<<<<<<< HEAD Assignment 2

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.5
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.0.5
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.0.5
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(corrplot)
## corrplot 0.90 loaded
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.0     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()    masks stats::filter()
## x xts::first()       masks dplyr::first()
## x dplyr::lag()       masks stats::lag()
## x xts::last()        masks dplyr::last()
## x Hmisc::src()       masks dplyr::src()
## x Hmisc::summarize() masks dplyr::summarize()
library(datasets)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
#Get your current working directory in R using **getwd ()** function
#getwd()
dir=getwd()
#Set your working  directory in R using **setwd ()** function
#setwd ()
setwd(dir)

#Task 1 Missing values

dat <- read.csv("Movie.csv", header=TRUE)
names(dat)
##  [1] "Color"                     "Director"                 
##  [3] "Reviews"                   "Duration"                 
##  [5] "Director_facebook_likes"   "Actor_3_facebook_likes"   
##  [7] "Actor_2_name"              "Actor_1_facebook_likes"   
##  [9] "Gross"                     "Genre"                    
## [11] "Actor_1_name"              "Title"                    
## [13] "Votes"                     "Cast_total_facebook_likes"
## [15] "Actor_3_name"              "Facenumber_in_poster"     
## [17] "Plot_keywords"             "Movie_imdb_link"          
## [19] "Language"                  "Content_rating"           
## [21] "Budget"                    "Year"                     
## [23] "Actor_2_facebook_likes"    "Imdb_score"               
## [25] "Aspect_ratio"              "Movie_facebook_likes"
str(dat)
## 'data.frame':    3891 obs. of  26 variables:
##  $ Color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ Director                 : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  $ Reviews                  : int  723 302 602 813 462 392 324 635 375 673 ...
##  $ Duration                 : int  178 169 148 164 132 156 100 141 153 183 ...
##  $ Director_facebook_likes  : int  0 563 0 22000 475 0 15 0 282 0 ...
##  $ Actor_3_facebook_likes   : int  855 1000 161 23000 530 4000 284 19000 10000 2000 ...
##  $ Actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
##  $ Actor_1_facebook_likes   : int  1000 40000 11000 27000 640 24000 799 26000 25000 15000 ...
##  $ Gross                    : int  760505847 309404152 200074175 448130642 73058679 336530303 200807262 458991599 301956980 330249062 ...
##  $ Genre                    : chr  "Action" "Action" "Action" "Action" ...
##  $ Actor_1_name             : chr  "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
##  $ Title                    : chr  "Avatara" "Pirates of the Caribbean: At World's Enda" "Spectrea" "The Dark Knight Risesa" ...
##  $ Votes                    : int  886204 471220 275868 1144337 212204 383056 294810 462669 321795 371639 ...
##  $ Cast_total_facebook_likes: int  2791 46563 11554 95000 2277 39000 1651 66000 46282 21000 ...
##  $ Actor_3_name             : chr  "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
##  $ Facenumber_in_poster     : int  0 0 1 0 1 0 1 4 3 0 ...
##  $ Plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
##  $ Movie_imdb_link          : chr  "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
##  $ Language                 : chr  "English" "English" "English" "English" ...
##  $ Content_rating           : chr  "PG-13" "" "PG-13" "PG-13" ...
##  $ Budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.64e+08 ...
##  $ Year                     : int  2009 2007 2015 2012 2012 2007 2010 2015 2009 2016 ...
##  $ Actor_2_facebook_likes   : int  936 5000 393 23000 632 11000 553 21000 11000 4000 ...
##  $ Imdb_score               : num  7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
##  $ Aspect_ratio             : num  1.78 2.35 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 ...
##  $ Movie_facebook_likes     : int  33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
sum(is.na(dat)) #Total amount of NAs before replacing blank cells with NAs
## [1] 101
#replace blank with NA
dat[dat == ""] <- NA
sum(is.na(dat)) #Total amount of NAs after replacing blank cells with NAs
## [1] 226

Number of NA in each column

colnames(dat)[colSums(is.na(dat)) > 0] #Names of columns that have NA
##  [1] "Color"                  "Reviews"                "Duration"              
##  [4] "Actor_3_facebook_likes" "Actor_2_name"           "Actor_1_facebook_likes"
##  [7] "Actor_1_name"           "Actor_3_name"           "Facenumber_in_poster"  
## [10] "Plot_keywords"          "Content_rating"         "Actor_2_facebook_likes"
## [13] "Aspect_ratio"
col_NA <- colSums(is.na(dat)) #The amount of NA each column has
col_NA
##                     Color                  Director                   Reviews 
##                         2                         0                         1 
##                  Duration   Director_facebook_likes    Actor_3_facebook_likes 
##                         1                         0                        10 
##              Actor_2_name    Actor_1_facebook_likes                     Gross 
##                         5                         3                         0 
##                     Genre              Actor_1_name                     Title 
##                         0                         3                         0 
##                     Votes Cast_total_facebook_likes              Actor_3_name 
##                         0                         0                        10 
##      Facenumber_in_poster             Plot_keywords           Movie_imdb_link 
##                         6                        31                         0 
##                  Language            Content_rating                    Budget 
##                         0                        74                         0 
##                      Year    Actor_2_facebook_likes                Imdb_score 
##                         0                         5                         0 
##              Aspect_ratio      Movie_facebook_likes 
##                        75                         0
class(col_NA)
## [1] "numeric"

Look at how many NAs for columns that have more than 0 NA and calculate the % of NA per variable.

for (e in 1:ncol(dat)){
  if (colSums(is.na(dat[,e, drop = FALSE]))>0) {print(paste("Column ", colnames(dat)[e], " has ", colSums(is.na(dat[,e, drop = FALSE])), " NA ", " which is only ","(", colSums(is.na(dat[,e, drop = FALSE])),"/3891)*100 = ", round((colSums(is.na(dat[,e, drop = FALSE]))/3891)*100, digits = 2), "% of the variable.", sep = ""))
    }
}
## [1] "Column Color has 2 NA  which is only (2/3891)*100 = 0.05% of the variable."
## [1] "Column Reviews has 1 NA  which is only (1/3891)*100 = 0.03% of the variable."
## [1] "Column Duration has 1 NA  which is only (1/3891)*100 = 0.03% of the variable."
## [1] "Column Actor_3_facebook_likes has 10 NA  which is only (10/3891)*100 = 0.26% of the variable."
## [1] "Column Actor_2_name has 5 NA  which is only (5/3891)*100 = 0.13% of the variable."
## [1] "Column Actor_1_facebook_likes has 3 NA  which is only (3/3891)*100 = 0.08% of the variable."
## [1] "Column Actor_1_name has 3 NA  which is only (3/3891)*100 = 0.08% of the variable."
## [1] "Column Actor_3_name has 10 NA  which is only (10/3891)*100 = 0.26% of the variable."
## [1] "Column Facenumber_in_poster has 6 NA  which is only (6/3891)*100 = 0.15% of the variable."
## [1] "Column Plot_keywords has 31 NA  which is only (31/3891)*100 = 0.8% of the variable."
## [1] "Column Content_rating has 74 NA  which is only (74/3891)*100 = 1.9% of the variable."
## [1] "Column Actor_2_facebook_likes has 5 NA  which is only (5/3891)*100 = 0.13% of the variable."
## [1] "Column Aspect_ratio has 75 NA  which is only (75/3891)*100 = 1.93% of the variable."

The amount of NAs in the columns that have NAs is minuscule. Except for column Aspect_ratio and Content_rating that has significantly higher NAs than other columns. However, at 1.9% NAs, it is still very small. Therefore no columns should be deleted due to the amount of NAs.

Look at how many NAs for rows that have more than 0 NA.

a <-0
for (e in 1:nrow(dat)){
  if (rowSums(is.na(dat[e,]))>0) {print(paste("observation", e, "has", rowSums(is.na(dat[e,])), "NA"))
    a <- a + 1
    }
}
## [1] "observation 2 has 1 NA"
## [1] "observation 18 has 1 NA"
## [1] "observation 53 has 1 NA"
## [1] "observation 87 has 1 NA"
## [1] "observation 96 has 1 NA"
## [1] "observation 125 has 1 NA"
## [1] "observation 153 has 1 NA"
## [1] "observation 164 has 1 NA"
## [1] "observation 198 has 1 NA"
## [1] "observation 237 has 1 NA"
## [1] "observation 240 has 1 NA"
## [1] "observation 265 has 1 NA"
## [1] "observation 272 has 1 NA"
## [1] "observation 306 has 1 NA"
## [1] "observation 360 has 1 NA"
## [1] "observation 370 has 1 NA"
## [1] "observation 408 has 1 NA"
## [1] "observation 443 has 1 NA"
## [1] "observation 494 has 1 NA"
## [1] "observation 496 has 1 NA"
## [1] "observation 542 has 1 NA"
## [1] "observation 557 has 1 NA"
## [1] "observation 561 has 1 NA"
## [1] "observation 599 has 1 NA"
## [1] "observation 673 has 1 NA"
## [1] "observation 907 has 1 NA"
## [1] "observation 1014 has 1 NA"
## [1] "observation 1260 has 1 NA"
## [1] "observation 1438 has 2 NA"
## [1] "observation 1667 has 2 NA"
## [1] "observation 1816 has 2 NA"
## [1] "observation 1869 has 1 NA"
## [1] "observation 1897 has 1 NA"
## [1] "observation 2023 has 1 NA"
## [1] "observation 2074 has 1 NA"
## [1] "observation 2137 has 1 NA"
## [1] "observation 2159 has 1 NA"
## [1] "observation 2164 has 1 NA"
## [1] "observation 2309 has 1 NA"
## [1] "observation 2341 has 1 NA"
## [1] "observation 2593 has 1 NA"
## [1] "observation 2619 has 3 NA"
## [1] "observation 2721 has 1 NA"
## [1] "observation 2722 has 1 NA"
## [1] "observation 2725 has 1 NA"
## [1] "observation 2773 has 1 NA"
## [1] "observation 2860 has 1 NA"
## [1] "observation 2890 has 1 NA"
## [1] "observation 2897 has 2 NA"
## [1] "observation 2905 has 1 NA"
## [1] "observation 2914 has 2 NA"
## [1] "observation 2959 has 4 NA"
## [1] "observation 3014 has 1 NA"
## [1] "observation 3028 has 1 NA"
## [1] "observation 3030 has 2 NA"
## [1] "observation 3034 has 2 NA"
## [1] "observation 3070 has 1 NA"
## [1] "observation 3079 has 1 NA"
## [1] "observation 3093 has 2 NA"
## [1] "observation 3101 has 1 NA"
## [1] "observation 3123 has 1 NA"
## [1] "observation 3156 has 3 NA"
## [1] "observation 3176 has 1 NA"
## [1] "observation 3182 has 1 NA"
## [1] "observation 3255 has 1 NA"
## [1] "observation 3257 has 2 NA"
## [1] "observation 3266 has 1 NA"
## [1] "observation 3280 has 1 NA"
## [1] "observation 3285 has 1 NA"
## [1] "observation 3288 has 1 NA"
## [1] "observation 3290 has 1 NA"
## [1] "observation 3306 has 1 NA"
## [1] "observation 3315 has 2 NA"
## [1] "observation 3318 has 1 NA"
## [1] "observation 3364 has 1 NA"
## [1] "observation 3367 has 1 NA"
## [1] "observation 3373 has 1 NA"
## [1] "observation 3401 has 1 NA"
## [1] "observation 3402 has 3 NA"
## [1] "observation 3411 has 1 NA"
## [1] "observation 3447 has 1 NA"
## [1] "observation 3453 has 2 NA"
## [1] "observation 3454 has 1 NA"
## [1] "observation 3459 has 2 NA"
## [1] "observation 3464 has 1 NA"
## [1] "observation 3468 has 1 NA"
## [1] "observation 3480 has 1 NA"
## [1] "observation 3496 has 2 NA"
## [1] "observation 3499 has 2 NA"
## [1] "observation 3511 has 2 NA"
## [1] "observation 3515 has 1 NA"
## [1] "observation 3518 has 1 NA"
## [1] "observation 3544 has 1 NA"
## [1] "observation 3553 has 1 NA"
## [1] "observation 3572 has 1 NA"
## [1] "observation 3573 has 1 NA"
## [1] "observation 3582 has 1 NA"
## [1] "observation 3584 has 1 NA"
## [1] "observation 3585 has 4 NA"
## [1] "observation 3591 has 1 NA"
## [1] "observation 3607 has 1 NA"
## [1] "observation 3608 has 2 NA"
## [1] "observation 3609 has 1 NA"
## [1] "observation 3622 has 1 NA"
## [1] "observation 3625 has 1 NA"
## [1] "observation 3627 has 1 NA"
## [1] "observation 3631 has 2 NA"
## [1] "observation 3643 has 7 NA"
## [1] "observation 3644 has 2 NA"
## [1] "observation 3645 has 3 NA"
## [1] "observation 3648 has 1 NA"
## [1] "observation 3650 has 1 NA"
## [1] "observation 3681 has 1 NA"
## [1] "observation 3686 has 1 NA"
## [1] "observation 3694 has 1 NA"
## [1] "observation 3696 has 2 NA"
## [1] "observation 3714 has 1 NA"
## [1] "observation 3724 has 2 NA"
## [1] "observation 3726 has 1 NA"
## [1] "observation 3740 has 1 NA"
## [1] "observation 3741 has 1 NA"
## [1] "observation 3746 has 2 NA"
## [1] "observation 3749 has 8 NA"
## [1] "observation 3757 has 1 NA"
## [1] "observation 3761 has 1 NA"
## [1] "observation 3768 has 2 NA"
## [1] "observation 3770 has 1 NA"
## [1] "observation 3777 has 1 NA"
## [1] "observation 3780 has 2 NA"
## [1] "observation 3788 has 1 NA"
## [1] "observation 3794 has 1 NA"
## [1] "observation 3796 has 1 NA"
## [1] "observation 3799 has 1 NA"
## [1] "observation 3802 has 6 NA"
## [1] "observation 3812 has 7 NA"
## [1] "observation 3813 has 1 NA"
## [1] "observation 3814 has 1 NA"
## [1] "observation 3826 has 1 NA"
## [1] "observation 3830 has 1 NA"
## [1] "observation 3832 has 1 NA"
## [1] "observation 3838 has 1 NA"
## [1] "observation 3839 has 1 NA"
## [1] "observation 3840 has 1 NA"
## [1] "observation 3841 has 2 NA"
## [1] "observation 3842 has 1 NA"
## [1] "observation 3852 has 2 NA"
## [1] "observation 3855 has 3 NA"
## [1] "observation 3856 has 3 NA"
## [1] "observation 3858 has 1 NA"
## [1] "observation 3861 has 1 NA"
## [1] "observation 3875 has 1 NA"
## [1] "observation 3877 has 2 NA"
## [1] "observation 3881 has 1 NA"
## [1] "observation 3882 has 2 NA"
## [1] "observation 3886 has 1 NA"
## [1] "observation 3888 has 1 NA"
## [1] "observation 3890 has 1 NA"
print(paste(a,"out of 3891 obversations has NA", ". Which is only", round((a/3891)*100,digits = 2), "% of the data"))
## [1] "157 out of 3891 obversations has NA . Which is only 4.03 % of the data"

There are 26 columns/variables. The amount of NAs for observations that have NAs is very small (ranging from 1-8 NAs). Since there are 3891 observations in this dataset, we can safely remove all variables that has NAs which is only about 4% of the data set without introducing bias.

dat.cleaned <- na.omit(dat)
any(is.na(dat.cleaned))
## [1] FALSE
dim(dat.cleaned)
## [1] 3734   26

After removing 157 observations with NAs, there are now 3734 obs left which is still a large amount of dataset for unbias correlation exploration.

#Task 2 Exploration: #Based on the Dataset, calculate “Profit” and determine the relationship between “Profit” and other variables (e.g. IMDB score). Hint: Profit = Gross – Budget. Use line plot or scatter plot to find the relationship.

There are no NAs in variable gross nor budget so therefore we can create a column profit as usually.

dat2 <- dat.cleaned
dat2$Profit <- dat2$Gross - dat2$Budget
head(dat2)
##   Color          Director Reviews Duration Director_facebook_likes
## 1 Color     James Cameron     723      178                       0
## 3 Color        Sam Mendes     602      148                       0
## 4 Color Christopher Nolan     813      164                   22000
## 5 Color    Andrew Stanton     462      132                     475
## 6 Color         Sam Raimi     392      156                       0
## 7 Color      Nathan Greno     324      100                      15
##   Actor_3_facebook_likes     Actor_2_name Actor_1_facebook_likes     Gross
## 1                    855 Joel David Moore                   1000 760505847
## 3                    161     Rory Kinnear                  11000 200074175
## 4                  23000   Christian Bale                  27000 448130642
## 5                    530  Samantha Morton                    640  73058679
## 6                   4000     James Franco                  24000 336530303
## 7                    284     Donna Murphy                    799 200807262
##       Genre    Actor_1_name                  Title   Votes
## 1    Action     CCH Pounder                Avatara  886204
## 3    Action Christoph Waltz               Spectrea  275868
## 4    Action       Tom Hardy The Dark Knight Risesa 1144337
## 5    Action    Daryl Sabara           John Cartera  212204
## 6    Action    J.K. Simmons          Spider-Man 3a  383056
## 7 Adventure    Brad Garrett               Tangleda  294810
##   Cast_total_facebook_likes         Actor_3_name Facenumber_in_poster
## 1                      2791            Wes Studi                    0
## 3                     11554     Stephanie Sigman                    1
## 4                     95000 Joseph Gordon-Levitt                    0
## 5                      2277         Polly Walker                    1
## 6                     39000        Kirsten Dunst                    0
## 7                      1651          M.C. Gainey                    1
##                                                      Plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5               alien|american civil war|male nipple|mars|princess
## 6                        sandman|spider man|symbiote|venom|villain
## 7             17th century|based on fairy tale|disney|flower|tower
##                                        Movie_imdb_link Language Content_rating
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1  English          PG-13
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1  English          PG-13
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1  English          PG-13
## 5 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1  English          PG-13
## 6 http://www.imdb.com/title/tt0413300/?ref_=fn_tt_tt_1  English          PG-13
## 7 http://www.imdb.com/title/tt0398286/?ref_=fn_tt_tt_1  English             PG
##      Budget Year Actor_2_facebook_likes Imdb_score Aspect_ratio
## 1 237000000 2009                    936        7.9         1.78
## 3 245000000 2015                    393        6.8         2.35
## 4 250000000 2012                  23000        8.5         2.35
## 5 263700000 2012                    632        6.6         2.35
## 6 258000000 2007                  11000        6.2         2.35
## 7 260000000 2010                    553        7.8         1.85
##   Movie_facebook_likes     Profit
## 1                33000  523505847
## 3                85000  -44925825
## 4               164000  198130642
## 5                24000 -190641321
## 6                    0   78530303
## 7                29000  -59192738

We will explore the relationship between profit and other comparable variables such as: Movie_facebook_likes, Aspect_ratio, Imdb_score, Actor_2_facebook_likes, Year, Budget, Content_rating, Facenumber_in_poster, Cast_total_facebook_likes, Votes, Genre, Gross, Actor_1_facebook_likes, Actor_3_facebook_likes, Director_facebook_likes, Duration, Reviews, Color.

#Profit vs Imdb_score
ggplot(data = dat2, mapping = aes(y = Profit, x = Imdb_score)) + geom_point(color = 'darkblue', alpha=1/5)

Since there are a few outliners of movies that made losses, it has made the scatter plot scale spread out and hard to visualise the correlation. For example, according to the plot, there was a movie that lost 12 billions dollars. It is questionable if this movie really made a loss of 12 billions dollars or was it a data error. Therefore, a new dataframe will be created to only take entries for only movies that made more than -200 millions dollars.

dat_profit <- dat2[dat2$Profit >= -2e8,]
head(dat_profit)
##   Color          Director Reviews Duration Director_facebook_likes
## 1 Color     James Cameron     723      178                       0
## 3 Color        Sam Mendes     602      148                       0
## 4 Color Christopher Nolan     813      164                   22000
## 5 Color    Andrew Stanton     462      132                     475
## 6 Color         Sam Raimi     392      156                       0
## 7 Color      Nathan Greno     324      100                      15
##   Actor_3_facebook_likes     Actor_2_name Actor_1_facebook_likes     Gross
## 1                    855 Joel David Moore                   1000 760505847
## 3                    161     Rory Kinnear                  11000 200074175
## 4                  23000   Christian Bale                  27000 448130642
## 5                    530  Samantha Morton                    640  73058679
## 6                   4000     James Franco                  24000 336530303
## 7                    284     Donna Murphy                    799 200807262
##       Genre    Actor_1_name                  Title   Votes
## 1    Action     CCH Pounder                Avatara  886204
## 3    Action Christoph Waltz               Spectrea  275868
## 4    Action       Tom Hardy The Dark Knight Risesa 1144337
## 5    Action    Daryl Sabara           John Cartera  212204
## 6    Action    J.K. Simmons          Spider-Man 3a  383056
## 7 Adventure    Brad Garrett               Tangleda  294810
##   Cast_total_facebook_likes         Actor_3_name Facenumber_in_poster
## 1                      2791            Wes Studi                    0
## 3                     11554     Stephanie Sigman                    1
## 4                     95000 Joseph Gordon-Levitt                    0
## 5                      2277         Polly Walker                    1
## 6                     39000        Kirsten Dunst                    0
## 7                      1651          M.C. Gainey                    1
##                                                      Plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5               alien|american civil war|male nipple|mars|princess
## 6                        sandman|spider man|symbiote|venom|villain
## 7             17th century|based on fairy tale|disney|flower|tower
##                                        Movie_imdb_link Language Content_rating
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1  English          PG-13
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1  English          PG-13
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1  English          PG-13
## 5 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1  English          PG-13
## 6 http://www.imdb.com/title/tt0413300/?ref_=fn_tt_tt_1  English          PG-13
## 7 http://www.imdb.com/title/tt0398286/?ref_=fn_tt_tt_1  English             PG
##      Budget Year Actor_2_facebook_likes Imdb_score Aspect_ratio
## 1 237000000 2009                    936        7.9         1.78
## 3 245000000 2015                    393        6.8         2.35
## 4 250000000 2012                  23000        8.5         2.35
## 5 263700000 2012                    632        6.6         2.35
## 6 258000000 2007                  11000        6.2         2.35
## 7 260000000 2010                    553        7.8         1.85
##   Movie_facebook_likes     Profit
## 1                33000  523505847
## 3                85000  -44925825
## 4               164000  198130642
## 5                24000 -190641321
## 6                    0   78530303
## 7                29000  -59192738
ggplot(data = dat_profit, mapping = aes(y = Profit, x = Imdb_score)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(se=FALSE, 
              method = "lm", 
              formula = y~poly(x,1), 
              size = 1.5)

There is a positive correlation between Imdb_score and profit. As Imdb_score increases, profit is predicted to increase also.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Movie_facebook_likes)) + geom_point(color = 'darkblue', alpha =1/5)+geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between Movie_facebook_likes and profit. As Movie_facebook_likes increases, profit is predicted to increase also. However, there is 1 outliner where Movie_facebook_likes is really high but profit is low. If it wasn’t for this outliner, the regression line may even show a higher positive correlation.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Aspect_ratio)) + geom_point(color = 'darkblue', alpha=1/5)+geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

According to the regression line, there seems to be a slight negative correlation between profit and Aspect_ratio. However, there is an outliner of Aspect_ratio being 16 with low profit, this could have skewed the regression line. Perhaps if it wasn’t for this outliner, the regression line could have been horizontal, hence no relationship between Profit and Aspect_ratio.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Actor_2_facebook_likes)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between Actor_2_facebook_likes and Profit. As Actor_2_facebook_likes increases, profit seems to also increase. There is an outliner of really high Actor_2_facebook_likes but low profit. Perhaps if it wasn’t for this outliner, the regression line would have been steeper, hence, higher positive correlation.

ggplot(data = dat_profit, mapping = aes(x = Year, y = Profit)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Interestingly and unexpectedly, as Year increases, Profit seems to slightly decrease. There is a negative correlation between Year and Profit. There are many new movies that make really high profit but there are also many that made losses.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Content_rating)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

In terms of Content_rating, PG-13 and PG have movies from losses and to high profit and some outliners with really high profit. R and G are spread out from losses to medium and high profit. Not Rated seem to make more losses than profit. Almost all aprroved rating seem to make profit but they are as high PG, PG-13, G, and R. The rest of the categories seem to not have as many movies.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Budget)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There seems to be a very small positive relationship between Profit and Budget. However, since budget is used to calculate profit, there should be a negative relationship between budget and profit if we hadn’t removed the all movies with profit less than $200,000,000.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Facenumber_in_poster)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a slight negative correlation between Facenumber_in_poster and Profit. There are also 2 outliners where Facenumber_in_poster are high but Profit is low which could have skewed the regression line a little bit. If it wasn’t for these 2 outliners, may we find that the regression line could be a little bit flatter which could mean that there is no correlation between Facenumber_in_poster and Profit.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Cast_total_facebook_likes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between Cast_total_facebook_likes and Profit. It is predicted that as Cast_total_facebook_likes increases, profit increases. If it wasn’t for the outliner in the far right, the positive correlation may even be stronger.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Votes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between Votes and Profit. It is predicted that as Votes increases, profit increases.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Genre)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") + theme(axis.text.x = element_text(angle = 45, hjust=1))
## `geom_smooth()` using formula 'y ~ x'

It can be said that action movies have big budget and as predicted earlier, the bigger the budget, the larger the profit. This could explain why the profit can be amongst the highest categories. Many action movies are also big losers in terms of profit as well. Other popular categories where profit ranges from really low to really high are Adventure, Animation, Comedy, Drama. Middle range categories are Biography, Crime, Horror, Fantasy, Mystery. Animation is spread out between low and high profit. Suprising Romance, Thriller, Musical, Western and Family don’t do so well and fall in the range that barely make profit.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Gross)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

As expected, the more money you make, the higher the profit. This shows strong positive correlation correlation.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Actor_1_facebook_likes)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a very slight positive correlation between Actor_1_facebook_likes and Profit. As Actor_1_facebook_likes increases, Profit is expected to slightly increase. If it wasn’t for the far right outliner, the regression line might even be steeper.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Actor_3_facebook_likes)) + geom_point(color = 'darkblue', alpha = 1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Actor_1_facebook_likes is sligher more important than Actor_3_facebook_likes and a lot of actors got no fb likes. There is a positive correlation between Actor_3_facebook_likes and Profit.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Director_facebook_likes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a very low positive correlation between Director_facebook_likes and Profit, many directors with no to little fb likes.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Duration)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a low positive correlation between Duration and Profit. The majority of movies have duration between 50 mins to 150 mins. The longer the duration of a movie doesn’t mean higher the profit.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Reviews)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a medium positive correlation between Reviews and Profit. As reviews increases, Profit is predicted to increase.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Color)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(se=FALSE, 
              method = "lm", 
              formula = y~poly(x,1), 
              size = 1.5)

ggplot(data = dat_profit, mapping = aes(x = Profit, y = ..density..)) + geom_freqpoly(mapping = aes(colour = Color), binwidth = 500000)

Generally, Colar movies can make a lot more money than Black and White but they can also lose a lot more profit as well. There are more Color movies as supposed to Black and White. In the density plot, it can be seen that a large amount of black and white movies make little to no profit.

#Calculate the correlation between the variable(s) used in the Dataset.

#Select only numeric variables for correlation matrix
dat_num <- dat2 %>% select_if(is.numeric)
dat_cor<- cor(dat_num)
round(dat_cor, 2)
##                           Reviews Duration Director_facebook_likes
## Reviews                      1.00     0.23                    0.18
## Duration                     0.23     1.00                    0.18
## Director_facebook_likes      0.18     0.18                    1.00
## Actor_3_facebook_likes       0.25     0.13                    0.12
## Actor_1_facebook_likes       0.16     0.08                    0.09
## Gross                        0.46     0.24                    0.14
## Votes                        0.59     0.34                    0.30
## Cast_total_facebook_likes    0.25     0.14                    0.27
## Facenumber_in_poster        -0.03     0.03                   -0.05
## Budget                       0.10     0.07                    0.02
## Year                         0.42    -0.13                   -0.04
## Actor_2_facebook_likes       0.25     0.13                    0.12
## Imdb_score                   0.35     0.37                    0.19
## Aspect_ratio                 0.18     0.15                    0.04
## Movie_facebook_likes         0.70     0.22                    0.16
## Profit                       0.04     0.01                    0.02
##                           Actor_3_facebook_likes Actor_1_facebook_likes Gross
## Reviews                                     0.25                   0.16  0.46
## Duration                                    0.13                   0.08  0.24
## Director_facebook_likes                     0.12                   0.09  0.14
## Actor_3_facebook_likes                      1.00                   0.25  0.30
## Actor_1_facebook_likes                      0.25                   1.00  0.14
## Gross                                       0.30                   0.14  1.00
## Votes                                       0.27                   0.18  0.62
## Cast_total_facebook_likes                   0.45                   0.94  0.23
## Facenumber_in_poster                        0.10                   0.06 -0.03
## Budget                                      0.04                   0.02  0.10
## Year                                        0.12                   0.09  0.05
## Actor_2_facebook_likes                      0.55                   0.39  0.25
## Imdb_score                                  0.06                   0.09  0.22
## Aspect_ratio                                0.05                   0.06  0.06
## Movie_facebook_likes                        0.27                   0.13  0.37
## Profit                                      0.05                   0.03  0.21
##                           Votes Cast_total_facebook_likes Facenumber_in_poster
## Reviews                    0.59                      0.25                -0.03
## Duration                   0.34                      0.14                 0.03
## Director_facebook_likes    0.30                      0.27                -0.05
## Actor_3_facebook_likes     0.27                      0.45                 0.10
## Actor_1_facebook_likes     0.18                      0.94                 0.06
## Gross                      0.62                      0.23                -0.03
## Votes                      1.00                      0.28                -0.03
## Cast_total_facebook_likes  0.28                      1.00                 0.06
## Facenumber_in_poster      -0.03                      0.06                 1.00
## Budget                     0.06                      0.03                -0.02
## Year                       0.02                      0.11                 0.07
## Actor_2_facebook_likes     0.24                      0.63                 0.07
## Imdb_score                 0.48                      0.14                -0.07
## Aspect_ratio               0.08                      0.07                 0.02
## Movie_facebook_likes       0.52                      0.21                 0.02
## Profit                     0.13                      0.04                 0.01
##                           Budget  Year Actor_2_facebook_likes Imdb_score
## Reviews                     0.10  0.42                   0.25       0.35
## Duration                    0.07 -0.13                   0.13       0.37
## Director_facebook_likes     0.02 -0.04                   0.12       0.19
## Actor_3_facebook_likes      0.04  0.12                   0.55       0.06
## Actor_1_facebook_likes      0.02  0.09                   0.39       0.09
## Gross                       0.10  0.05                   0.25       0.22
## Votes                       0.06  0.02                   0.24       0.48
## Cast_total_facebook_likes   0.03  0.11                   0.63       0.14
## Facenumber_in_poster       -0.02  0.07                   0.07      -0.07
## Budget                      1.00  0.05                   0.03       0.03
## Year                        0.05  1.00                   0.12      -0.14
## Actor_2_facebook_likes      0.03  0.12                   1.00       0.10
## Imdb_score                  0.03 -0.14                   0.10       1.00
## Aspect_ratio                0.02  0.22                   0.06       0.03
## Movie_facebook_likes        0.05  0.31                   0.23       0.28
## Profit                     -0.95 -0.03                   0.04       0.04
##                           Aspect_ratio Movie_facebook_likes Profit
## Reviews                           0.18                 0.70   0.04
## Duration                          0.15                 0.22   0.01
## Director_facebook_likes           0.04                 0.16   0.02
## Actor_3_facebook_likes            0.05                 0.27   0.05
## Actor_1_facebook_likes            0.06                 0.13   0.03
## Gross                             0.06                 0.37   0.21
## Votes                             0.08                 0.52   0.13
## Cast_total_facebook_likes         0.07                 0.21   0.04
## Facenumber_in_poster              0.02                 0.02   0.01
## Budget                            0.02                 0.05  -0.95
## Year                              0.22                 0.31  -0.03
## Actor_2_facebook_likes            0.06                 0.23   0.04
## Imdb_score                        0.03                 0.28   0.04
## Aspect_ratio                      1.00                 0.11  -0.01
## Movie_facebook_likes              0.11                 1.00   0.06
## Profit                           -0.01                 0.06   1.00
corrplot(dat_cor, method = "pie", type = "upper", tl.pos="l")

We will use dat2 to plot for correlations that don’t involve profit as one of the variables. This way we can incldue all observations.

dat_profit will be used for most correlations that have profit as one of the variables because due to outliners, it would easier to visualise when we exclude those that make less than $200 millions. However, dat2 will be used for the correlation of budget and profit and the reason will be explained below.

#Strong correlations

Actor_1_facebook_likes has a strong positive correlation with Cast_total_facebook_likes with coefficient of 0.94.

ggplot(data = dat2, mapping = aes(y = Cast_total_facebook_likes, x = Actor_1_facebook_likes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a very strong positive correlation between Actor_1_facebook_likes and Cast_total_facebook_likes probably because Actor_1_facebook_likes is a large part of Cast_total_facebook_likes so as Actor_1_facebook_likes increases, Cast_total_facebook_likes is expected to also increase as well.

Reviews has a strong positive correlation with Movie_facebook_likes with coefficient of 0.7

ggplot(data = dat2, mapping = aes(x = Movie_facebook_likes, y = Reviews)) + geom_point(color = 'darkblue',alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Reviews has a strong positive correlation with Movie_facebook_likes probably because if a movie has more reviews on IMDB, people will also look at that movie on fb and chances of that movie getting likes will also increase.

Reviews and Votes have positive correlation with coefficient 0.59

ggplot(data = dat2, mapping = aes(x = Votes, y = Reviews)) + geom_point(color = 'darkblue', alpha = 1/5) +geom_smooth(se=FALSE, 
              method = "lm", 
              formula = y~poly(x,1), 
              size = 1.5)

It is plausible to argue that the higher the votes are, the more reviews a movie can get because audience can have the tendency to select a movie to watch based on votes. After watching, they will review the movie.

Gross and Votes have positive correlation with coefficient being 0.62

ggplot(data = dat2, mapping = aes(x = Votes, y = Gross)) + geom_point(color = 'darkblue', alpha = 1/10) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

It could be said that the more Votes a movies, the more people would want to watch the movie and hence the higher Gross income.

Actor_2_facebook_likes and Cast_total_facebook_likes have positive correlation with coefficient being 0.63

ggplot(data = dat2, mapping = aes(x = Actor_2_facebook_likes, y = Cast_total_facebook_likes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Actor_2_facebook_likes contributes to Cast_total_facebook_likes so the positive correlation here is self-explanatory. However, there are 2 outliners. The first one has high Actor_2_facebook_likes but low Cast_total_facebook_likes, this could be because the actor is popular but the rest of the cast is not. The second one has high Cast_total_facebook_likes but low Actor_2_facebook_likes, this could be because the 2nd actor is not popular but the rest of the cast are more fans’ favorites.

Budget has a very strong negative correlation with Profit with coefficient of -0.95

ggplot(data = dat2, mapping = aes(x = Budget, y = Profit)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

dat2 is used here instead of dat_profit because we want to include all movies so we can get the most accurate regression line for this correlation when coefficient is -0.95. If we use dat_profit, the regression line will be slightly upwards which doesn’t represent coefficient of -0.95.

Profit is Gross - Budget so mathematically, the higher the Budget, the lower the Profit and hence the negative correlation, hence the steep downwards slope.

#Weak correlations There are many weak correlations, some of the weakly correlated pairs are:

Duration and Profit only have coefficient of 0.01

ggplot(data = dat_profit, mapping = aes(x = Duration, y = Profit)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

This is probably because the duration of the movie does not really determine how successful the movie is going to be. A long movie can have bad quality that leads to low profit.

Aspect_ratio and Profit only have coefficient of -0.01

ggplot(data = dat_profit, mapping = aes(x = Aspect_ratio, y = Profit)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

In addition to the explanation about profit and aspect_ratio above, the aspect_ratio is probably isn’t one of the main factors that make the audience fall in love with a movie, it’s probably more about the content of the movie. Therefore, aspect_ratio doesn’t have a direct impact on a movie’s profit.

Facenumber_in_poster and Profit only have coefficient of 0.01

ggplot(data = dat_profit, mapping = aes(x = Facenumber_in_poster, y = Profit)) + geom_point(color = 'darkblue',alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is probably isn’t a correlation here because the amount of faces appear in a poster doesn’t directly impact how successful a movie is. It is arguable that many bombshell movies have only 1 or 2 faces on the poster. Having too many faces on the poster could make tha poster forgetable and hence reduces the popularity of the movie which could also negatively impact a movie’s income.

Year and Votes only have coefficient of 0.01

ggplot(data = dat2, mapping = aes(x = Year, y = Votes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Newer movies do not necessarily have higher Votes than their older counterparts. There are many newer movies with low Votes as well. In other words, how old a movie is does not have an impact on the amount of Votes.

Budget and Actor_1_facebook_likes only have coefficient of 0.02

ggplot(data = dat2, mapping = aes(y = Actor_1_facebook_likes, x = Budget)) + geom_point(color = 'darkblue', alpha=1/5)+geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

A high budget movie does not mean that the main actor will get a lot of facebook likes given the fact that the budget is mainly used for the production of the movie, not to promote and PR the main actor’s private facebook page.

Aspect_ratio and Imdb_score only have coefficient of 0.03

ggplot(data = dat2, mapping = aes(x = Aspect_ratio, y= Imdb_score)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is an outliner of of aspect_ratio 16 that happen to have a decent Imdb_score, if it wasn’t for that, the regression line would probably be flatter. It is plausible to say that high aspect_ratio does not impact Imdb_score because Imdb_score is probably more influenced by the content, story line, quality of the movie rather than the aspect_ratio alone.

#Task 2

dat <- read.csv("Titanic.csv", header=TRUE)
names(dat)
## [1] "X"        "Class"    "Sex"      "Age"      "Survived"
str(dat)
## 'data.frame':    2201 obs. of  5 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Class   : chr  "3rd" "3rd" "3rd" "3rd" ...
##  $ Sex     : chr  "Male" "Male" "Male" "Male" ...
##  $ Age     : chr  "Child" "Child" "Child" "Child" ...
##  $ Survived: chr  "No" "No" "No" "No" ...
sum(is.na(dat)) #Total amount of NAs before replacing blank cells with NAs
## [1] 0
#replace blank with NA
dat[dat == ""] <- NA
sum(is.na(dat)) #Total amount of NAs after replacing blank cells with NAs
## [1] 0

There are no NA nor blanks

summary(dat)
##        X           Class               Sex                Age           
##  Min.   :   1   Length:2201        Length:2201        Length:2201       
##  1st Qu.: 551   Class :character   Class :character   Class :character  
##  Median :1101   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1101                                                           
##  3rd Qu.:1651                                                           
##  Max.   :2201                                                           
##    Survived        
##  Length:2201       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Need to convert Class, Sex, Aged and Survived into factors for further analysis. Also this can pick up bad data entries if any.

dat$Class <- factor(dat$Class)
dat$Sex <- factor(dat$Sex)
dat$Survived <- factor(dat$Survived)
dat$Age <- factor(dat$Age)
summary(dat)
##        X         Class         Sex          Age       Survived  
##  Min.   :   1   1st :325   Female: 470   Adult:2092   No :1490  
##  1st Qu.: 551   2nd :285   Male  :1731   Child: 109   Yes: 711  
##  Median :1101   3rd :706                                        
##  Mean   :1101   Crew:885                                        
##  3rd Qu.:1651                                                   
##  Max.   :2201
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
class(dat)
## [1] "data.frame"
dat2 <- dat[,-1]
dat2 <- as(dat2, "transactions")
inspect(dat2[1:10])
##      items                                      transactionID
## [1]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 1            
## [2]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 2            
## [3]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 3            
## [4]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 4            
## [5]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 5            
## [6]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 6            
## [7]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 7            
## [8]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 8            
## [9]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 9            
## [10] {Class=3rd,Sex=Male,Age=Child,Survived=No} 10
survied_rules = apriori(dat2, parameter = list(minlen=2,supp=0.005,conf=0.8), appearance = list(default = "lhs", rhs ="Survived=Yes"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 11 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[10 item(s), 2201 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(survied_rules)
##     lhs                                  rhs            support     confidence
## [1] {Class=2nd,Age=Child}             => {Survived=Yes} 0.010904134 1.0000000 
## [2] {Class=2nd,Sex=Female}            => {Survived=Yes} 0.042253521 0.8773585 
## [3] {Class=1st,Sex=Female}            => {Survived=Yes} 0.064061790 0.9724138 
## [4] {Class=Crew,Sex=Female}           => {Survived=Yes} 0.009086779 0.8695652 
## [5] {Class=2nd,Sex=Female,Age=Child}  => {Survived=Yes} 0.005906406 1.0000000 
## [6] {Class=2nd,Sex=Female,Age=Adult}  => {Survived=Yes} 0.036347115 0.8602151 
## [7] {Class=1st,Sex=Female,Age=Adult}  => {Survived=Yes} 0.063607451 0.9722222 
## [8] {Class=Crew,Sex=Female,Age=Adult} => {Survived=Yes} 0.009086779 0.8695652 
##     coverage    lift     count
## [1] 0.010904134 3.095640  24  
## [2] 0.048159927 2.715986  93  
## [3] 0.065879146 3.010243 141  
## [4] 0.010449796 2.691861  20  
## [5] 0.005906406 3.095640  13  
## [6] 0.042253521 2.662916  80  
## [7] 0.065424807 3.009650 140  
## [8] 0.010449796 2.691861  20

Sort rules based on lift: rules_lift <- sort (rules, by=“lift”, decreasing=TRUE) # ‘high-lift’ rules.

rules_lift <- sort (survied_rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(rules_lift)
##     lhs                                  rhs            support     confidence
## [1] {Class=2nd,Age=Child}             => {Survived=Yes} 0.010904134 1.0000000 
## [2] {Class=2nd,Sex=Female,Age=Child}  => {Survived=Yes} 0.005906406 1.0000000 
## [3] {Class=1st,Sex=Female}            => {Survived=Yes} 0.064061790 0.9724138 
## [4] {Class=1st,Sex=Female,Age=Adult}  => {Survived=Yes} 0.063607451 0.9722222 
## [5] {Class=2nd,Sex=Female}            => {Survived=Yes} 0.042253521 0.8773585 
## [6] {Class=Crew,Sex=Female}           => {Survived=Yes} 0.009086779 0.8695652 
## [7] {Class=Crew,Sex=Female,Age=Adult} => {Survived=Yes} 0.009086779 0.8695652 
## [8] {Class=2nd,Sex=Female,Age=Adult}  => {Survived=Yes} 0.036347115 0.8602151 
##     coverage    lift     count
## [1] 0.010904134 3.095640  24  
## [2] 0.005906406 3.095640  13  
## [3] 0.065879146 3.010243 141  
## [4] 0.065424807 3.009650 140  
## [5] 0.048159927 2.715986  93  
## [6] 0.010449796 2.691861  20  
## [7] 0.010449796 2.691861  20  
## [8] 0.042253521 2.662916  80

Above are the pairs of entries with highest lift values

# Top 3 are as follow:
inspect(rules_lift[1:3])
##     lhs                                 rhs            support     confidence
## [1] {Class=2nd,Age=Child}            => {Survived=Yes} 0.010904134 1.0000000 
## [2] {Class=2nd,Sex=Female,Age=Child} => {Survived=Yes} 0.005906406 1.0000000 
## [3] {Class=1st,Sex=Female}           => {Survived=Yes} 0.064061790 0.9724138 
##     coverage    lift     count
## [1] 0.010904134 3.095640  24  
## [2] 0.005906406 3.095640  13  
## [3] 0.065879146 3.010243 141
plot(rules_lift[1:3], method ="graph")

Question: Based on the generated rules, is there any difference between “Children” travelling first class, second class, and third class? Support your answer with an appropriate argument

Based on the rules that has highest lift values, there are not enough information about Children in first class and third class to compare.

survied_rules_class = apriori(dat2, parameter = list(minlen=2,supp=0.001,conf=0.3), appearance = list(lhs = c("Class=1st", "Class=2nd","Class=3rd" , "Age=Child"), rhs ="Survived=Yes"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2 
## 
## set item appearances ...[5 item(s)] done [0.00s].
## set transactions ...[5 item(s), 2201 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules_lift_class <- sort (survied_rules_class, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(rules_lift_class)
##     lhs                      rhs            support     confidence coverage   
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134 1.0000000  0.010904134
## [2] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034 1.0000000  0.002726034
## [3] {Class=1st}           => {Survived=Yes} 0.092230804 0.6246154  0.147660154
## [4] {Age=Child}           => {Survived=Yes} 0.025897319 0.5229358  0.049522944
## [5] {Class=2nd}           => {Survived=Yes} 0.053611995 0.4140351  0.129486597
## [6] {Class=3rd,Age=Child} => {Survived=Yes} 0.012267151 0.3417722  0.035892776
##     lift     count
## [1] 3.095640  24  
## [2] 3.095640   6  
## [3] 1.933584 203  
## [4] 1.618821  57  
## [5] 1.281704 118  
## [6] 1.058004  27

The reason why 1st class and 3rd class children didn’t show up in survied_rules is because the support and confident level was set too high then. Here we can see that 3rd class children have support of 0.01 and confidence of 0.34 so we needed to lower the support to get the rules for this demographics.

inspect(rules_lift_class[c(1,2,6)])
##     lhs                      rhs            support     confidence coverage   
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134 1.0000000  0.010904134
## [2] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034 1.0000000  0.002726034
## [3] {Class=3rd,Age=Child} => {Survived=Yes} 0.012267151 0.3417722  0.035892776
##     lift     count
## [1] 3.095640 24   
## [2] 3.095640  6   
## [3] 1.058004 27
plot(rules_lift_class[c(1,2,6)], method ="graph")

Based on the rules and the graph, 1st class and 2nd class children are 3 times more likely to survive than 3rd class children (lift of 3.09 vs lift of 1.05). The confidence of 1 for children in 1st class and 2nd class showed that 100% of these demographics survived whereas confidence of 0.34 for 3rd class children means that only 34% of these demographics survived. In other words, 3rd class children are 3 times less likely to survive than those on 1st and 2nd class.

The size of the circle indicates that there are a lot more children on 2nd class than first class who survived. Based on the support, 1% of the population are 2nd class children (all survived) whereas 0.2% of the population are 1st class children (all survived). 1.2% of the population is 3rd class children who survived but only a bit more than 30% of all 3rd class children survived. In other words, the majority of 3rd class children did not survived.

Explore more rules

#Set the support and confidence low to get more rules
survied_rules_explore = apriori(dat2, parameter = list(minlen=2,supp=0.06,conf=0.2), appearance = list(default="lhs", rhs ="Survived=Yes"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.06      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 132 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[10 item(s), 2201 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(survied_rules_explore)
##      lhs                                 rhs            support    confidence
## [1]  {Class=1st}                      => {Survived=Yes} 0.09223080 0.6246154 
## [2]  {Sex=Female}                     => {Survived=Yes} 0.15629259 0.7319149 
## [3]  {Class=3rd}                      => {Survived=Yes} 0.08087233 0.2521246 
## [4]  {Class=Crew}                     => {Survived=Yes} 0.09631985 0.2395480 
## [5]  {Sex=Male}                       => {Survived=Yes} 0.16674239 0.2120162 
## [6]  {Age=Adult}                      => {Survived=Yes} 0.29713766 0.3126195 
## [7]  {Class=1st,Sex=Female}           => {Survived=Yes} 0.06406179 0.9724138 
## [8]  {Class=1st,Age=Adult}            => {Survived=Yes} 0.08950477 0.6175549 
## [9]  {Sex=Female,Age=Adult}           => {Survived=Yes} 0.14357110 0.7435294 
## [10] {Class=3rd,Age=Adult}            => {Survived=Yes} 0.06860518 0.2408293 
## [11] {Class=Crew,Sex=Male}            => {Survived=Yes} 0.08723308 0.2227378 
## [12] {Class=Crew,Age=Adult}           => {Survived=Yes} 0.09631985 0.2395480 
## [13] {Sex=Male,Age=Adult}             => {Survived=Yes} 0.15356656 0.2027594 
## [14] {Class=1st,Sex=Female,Age=Adult} => {Survived=Yes} 0.06360745 0.9722222 
## [15] {Class=Crew,Sex=Male,Age=Adult}  => {Survived=Yes} 0.08723308 0.2227378 
##      coverage   lift      count
## [1]  0.14766015 1.9335843 203  
## [2]  0.21353930 2.2657450 344  
## [3]  0.32076329 0.7804871 178  
## [4]  0.40208996 0.7415544 212  
## [5]  0.78646070 0.6563257 367  
## [6]  0.95047706 0.9677574 654  
## [7]  0.06587915 3.0102430 141  
## [8]  0.14493412 1.9117275 197  
## [9]  0.19309405 2.3016993 316  
## [10] 0.28487051 0.7455209 151  
## [11] 0.39164016 0.6895161 192  
## [12] 0.40208996 0.7415544 212  
## [13] 0.75738301 0.6276702 338  
## [14] 0.06542481 3.0096499 140  
## [15] 0.39164016 0.6895161 192

711 out of 1490 who survived so about 1/3 of the population survived. So in general, an average persion has about 33% chance of surviving the titanic and lift for the general population to survive is 1.

Female passengers are 2 times more likely to survive compared to the general population (lift =2.2). They also have a much higher chance of survival compared to 3rd class and male passengers ( lift 2.2 > 0.78 > 0.65).

Third class passengers are less likely to survive compared to the general population(lift =0.78 vs lift = 1). However 3rd class passengers have little more chance to survive compared to male passenagers (0.78 vs 0.65 lift).

Only 23% of crew survived and they are less likely to survive compared to the general population. They are about 25% less likely to survive compared to the general population (lift 0.74 vs lift 1) and they are only a little bit more likely to survive than male passengers (lift 0.74 vs lift 0.65).

Male passengers’ survival chances are low (lift only 0.65) so lower than the general population (0.65 vs 1), a little bit lower than 3rd class passenger (0.65 vs 0.78) and a lot lower than their female counterpart (0.65 vs 2.2).

First class passengers are almost twice more likely to survive compared to the general population (lift 1.9 vs lift 1). They are a little bit less likely to survive compared to women (lift 1.9 vs lift 2.2), however, almost 3 times more likely to survive compared to male passengers (lift 1.9 vs lift 0.65).

If you’re a first class female adult, your chances of surviving is nearly 100% (confidence =0.97). There are only about 6.3% (support = 0.063) of 1st class female adults on the ship. Your chances of surviving would be 3 times higher than the general population (lift 3 vs 1) and also higher than the general female demographics (lift 3 vs 2.2).

======= Assignment 2

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.5
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.0.5
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.0.5
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(corrplot)
## corrplot 0.90 loaded
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.0     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()    masks stats::filter()
## x xts::first()       masks dplyr::first()
## x dplyr::lag()       masks stats::lag()
## x xts::last()        masks dplyr::last()
## x Hmisc::src()       masks dplyr::src()
## x Hmisc::summarize() masks dplyr::summarize()
library(datasets)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
#Get your current working directory in R using **getwd ()** function
#getwd()
dir=getwd()
#Set your working  directory in R using **setwd ()** function
#setwd ()
setwd(dir)

#Task 1 Missing values

dat <- read.csv("Movie.csv", header=TRUE)
names(dat)
##  [1] "Color"                     "Director"                 
##  [3] "Reviews"                   "Duration"                 
##  [5] "Director_facebook_likes"   "Actor_3_facebook_likes"   
##  [7] "Actor_2_name"              "Actor_1_facebook_likes"   
##  [9] "Gross"                     "Genre"                    
## [11] "Actor_1_name"              "Title"                    
## [13] "Votes"                     "Cast_total_facebook_likes"
## [15] "Actor_3_name"              "Facenumber_in_poster"     
## [17] "Plot_keywords"             "Movie_imdb_link"          
## [19] "Language"                  "Content_rating"           
## [21] "Budget"                    "Year"                     
## [23] "Actor_2_facebook_likes"    "Imdb_score"               
## [25] "Aspect_ratio"              "Movie_facebook_likes"
str(dat)
## 'data.frame':    3891 obs. of  26 variables:
##  $ Color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ Director                 : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  $ Reviews                  : int  723 302 602 813 462 392 324 635 375 673 ...
##  $ Duration                 : int  178 169 148 164 132 156 100 141 153 183 ...
##  $ Director_facebook_likes  : int  0 563 0 22000 475 0 15 0 282 0 ...
##  $ Actor_3_facebook_likes   : int  855 1000 161 23000 530 4000 284 19000 10000 2000 ...
##  $ Actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
##  $ Actor_1_facebook_likes   : int  1000 40000 11000 27000 640 24000 799 26000 25000 15000 ...
##  $ Gross                    : int  760505847 309404152 200074175 448130642 73058679 336530303 200807262 458991599 301956980 330249062 ...
##  $ Genre                    : chr  "Action" "Action" "Action" "Action" ...
##  $ Actor_1_name             : chr  "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
##  $ Title                    : chr  "Avatara" "Pirates of the Caribbean: At World's Enda" "Spectrea" "The Dark Knight Risesa" ...
##  $ Votes                    : int  886204 471220 275868 1144337 212204 383056 294810 462669 321795 371639 ...
##  $ Cast_total_facebook_likes: int  2791 46563 11554 95000 2277 39000 1651 66000 46282 21000 ...
##  $ Actor_3_name             : chr  "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
##  $ Facenumber_in_poster     : int  0 0 1 0 1 0 1 4 3 0 ...
##  $ Plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
##  $ Movie_imdb_link          : chr  "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
##  $ Language                 : chr  "English" "English" "English" "English" ...
##  $ Content_rating           : chr  "PG-13" "" "PG-13" "PG-13" ...
##  $ Budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.64e+08 ...
##  $ Year                     : int  2009 2007 2015 2012 2012 2007 2010 2015 2009 2016 ...
##  $ Actor_2_facebook_likes   : int  936 5000 393 23000 632 11000 553 21000 11000 4000 ...
##  $ Imdb_score               : num  7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
##  $ Aspect_ratio             : num  1.78 2.35 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 ...
##  $ Movie_facebook_likes     : int  33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
sum(is.na(dat)) #Total amount of NAs before replacing blank cells with NAs
## [1] 101
#replace blank with NA
dat[dat == ""] <- NA
sum(is.na(dat)) #Total amount of NAs after replacing blank cells with NAs
## [1] 226

Number of NA in each column

colnames(dat)[colSums(is.na(dat)) > 0] #Names of columns that have NA
##  [1] "Color"                  "Reviews"                "Duration"              
##  [4] "Actor_3_facebook_likes" "Actor_2_name"           "Actor_1_facebook_likes"
##  [7] "Actor_1_name"           "Actor_3_name"           "Facenumber_in_poster"  
## [10] "Plot_keywords"          "Content_rating"         "Actor_2_facebook_likes"
## [13] "Aspect_ratio"
col_NA <- colSums(is.na(dat)) #The amount of NA each column has
col_NA
##                     Color                  Director                   Reviews 
##                         2                         0                         1 
##                  Duration   Director_facebook_likes    Actor_3_facebook_likes 
##                         1                         0                        10 
##              Actor_2_name    Actor_1_facebook_likes                     Gross 
##                         5                         3                         0 
##                     Genre              Actor_1_name                     Title 
##                         0                         3                         0 
##                     Votes Cast_total_facebook_likes              Actor_3_name 
##                         0                         0                        10 
##      Facenumber_in_poster             Plot_keywords           Movie_imdb_link 
##                         6                        31                         0 
##                  Language            Content_rating                    Budget 
##                         0                        74                         0 
##                      Year    Actor_2_facebook_likes                Imdb_score 
##                         0                         5                         0 
##              Aspect_ratio      Movie_facebook_likes 
##                        75                         0
class(col_NA)
## [1] "numeric"

Look at how many NAs for columns that have more than 0 NA and calculate the % of NA per variable.

for (e in 1:ncol(dat)){
  if (colSums(is.na(dat[,e, drop = FALSE]))>0) {print(paste("Column ", colnames(dat)[e], " has ", colSums(is.na(dat[,e, drop = FALSE])), " NA ", " which is only ","(", colSums(is.na(dat[,e, drop = FALSE])),"/3891)*100 = ", round((colSums(is.na(dat[,e, drop = FALSE]))/3891)*100, digits = 2), "% of the variable.", sep = ""))
    }
}
## [1] "Column Color has 2 NA  which is only (2/3891)*100 = 0.05% of the variable."
## [1] "Column Reviews has 1 NA  which is only (1/3891)*100 = 0.03% of the variable."
## [1] "Column Duration has 1 NA  which is only (1/3891)*100 = 0.03% of the variable."
## [1] "Column Actor_3_facebook_likes has 10 NA  which is only (10/3891)*100 = 0.26% of the variable."
## [1] "Column Actor_2_name has 5 NA  which is only (5/3891)*100 = 0.13% of the variable."
## [1] "Column Actor_1_facebook_likes has 3 NA  which is only (3/3891)*100 = 0.08% of the variable."
## [1] "Column Actor_1_name has 3 NA  which is only (3/3891)*100 = 0.08% of the variable."
## [1] "Column Actor_3_name has 10 NA  which is only (10/3891)*100 = 0.26% of the variable."
## [1] "Column Facenumber_in_poster has 6 NA  which is only (6/3891)*100 = 0.15% of the variable."
## [1] "Column Plot_keywords has 31 NA  which is only (31/3891)*100 = 0.8% of the variable."
## [1] "Column Content_rating has 74 NA  which is only (74/3891)*100 = 1.9% of the variable."
## [1] "Column Actor_2_facebook_likes has 5 NA  which is only (5/3891)*100 = 0.13% of the variable."
## [1] "Column Aspect_ratio has 75 NA  which is only (75/3891)*100 = 1.93% of the variable."

The amount of NAs in the columns that have NAs is minuscule. Except for column Aspect_ratio and Content_rating that has significantly higher NAs than other columns. However, at 1.9% NAs, it is still very small. Therefore no columns should be deleted due to the amount of NAs.

Look at how many NAs for rows that have more than 0 NA.

a <-0
for (e in 1:nrow(dat)){
  if (rowSums(is.na(dat[e,]))>0) {print(paste("observation", e, "has", rowSums(is.na(dat[e,])), "NA"))
    a <- a + 1
    }
}
## [1] "observation 2 has 1 NA"
## [1] "observation 18 has 1 NA"
## [1] "observation 53 has 1 NA"
## [1] "observation 87 has 1 NA"
## [1] "observation 96 has 1 NA"
## [1] "observation 125 has 1 NA"
## [1] "observation 153 has 1 NA"
## [1] "observation 164 has 1 NA"
## [1] "observation 198 has 1 NA"
## [1] "observation 237 has 1 NA"
## [1] "observation 240 has 1 NA"
## [1] "observation 265 has 1 NA"
## [1] "observation 272 has 1 NA"
## [1] "observation 306 has 1 NA"
## [1] "observation 360 has 1 NA"
## [1] "observation 370 has 1 NA"
## [1] "observation 408 has 1 NA"
## [1] "observation 443 has 1 NA"
## [1] "observation 494 has 1 NA"
## [1] "observation 496 has 1 NA"
## [1] "observation 542 has 1 NA"
## [1] "observation 557 has 1 NA"
## [1] "observation 561 has 1 NA"
## [1] "observation 599 has 1 NA"
## [1] "observation 673 has 1 NA"
## [1] "observation 907 has 1 NA"
## [1] "observation 1014 has 1 NA"
## [1] "observation 1260 has 1 NA"
## [1] "observation 1438 has 2 NA"
## [1] "observation 1667 has 2 NA"
## [1] "observation 1816 has 2 NA"
## [1] "observation 1869 has 1 NA"
## [1] "observation 1897 has 1 NA"
## [1] "observation 2023 has 1 NA"
## [1] "observation 2074 has 1 NA"
## [1] "observation 2137 has 1 NA"
## [1] "observation 2159 has 1 NA"
## [1] "observation 2164 has 1 NA"
## [1] "observation 2309 has 1 NA"
## [1] "observation 2341 has 1 NA"
## [1] "observation 2593 has 1 NA"
## [1] "observation 2619 has 3 NA"
## [1] "observation 2721 has 1 NA"
## [1] "observation 2722 has 1 NA"
## [1] "observation 2725 has 1 NA"
## [1] "observation 2773 has 1 NA"
## [1] "observation 2860 has 1 NA"
## [1] "observation 2890 has 1 NA"
## [1] "observation 2897 has 2 NA"
## [1] "observation 2905 has 1 NA"
## [1] "observation 2914 has 2 NA"
## [1] "observation 2959 has 4 NA"
## [1] "observation 3014 has 1 NA"
## [1] "observation 3028 has 1 NA"
## [1] "observation 3030 has 2 NA"
## [1] "observation 3034 has 2 NA"
## [1] "observation 3070 has 1 NA"
## [1] "observation 3079 has 1 NA"
## [1] "observation 3093 has 2 NA"
## [1] "observation 3101 has 1 NA"
## [1] "observation 3123 has 1 NA"
## [1] "observation 3156 has 3 NA"
## [1] "observation 3176 has 1 NA"
## [1] "observation 3182 has 1 NA"
## [1] "observation 3255 has 1 NA"
## [1] "observation 3257 has 2 NA"
## [1] "observation 3266 has 1 NA"
## [1] "observation 3280 has 1 NA"
## [1] "observation 3285 has 1 NA"
## [1] "observation 3288 has 1 NA"
## [1] "observation 3290 has 1 NA"
## [1] "observation 3306 has 1 NA"
## [1] "observation 3315 has 2 NA"
## [1] "observation 3318 has 1 NA"
## [1] "observation 3364 has 1 NA"
## [1] "observation 3367 has 1 NA"
## [1] "observation 3373 has 1 NA"
## [1] "observation 3401 has 1 NA"
## [1] "observation 3402 has 3 NA"
## [1] "observation 3411 has 1 NA"
## [1] "observation 3447 has 1 NA"
## [1] "observation 3453 has 2 NA"
## [1] "observation 3454 has 1 NA"
## [1] "observation 3459 has 2 NA"
## [1] "observation 3464 has 1 NA"
## [1] "observation 3468 has 1 NA"
## [1] "observation 3480 has 1 NA"
## [1] "observation 3496 has 2 NA"
## [1] "observation 3499 has 2 NA"
## [1] "observation 3511 has 2 NA"
## [1] "observation 3515 has 1 NA"
## [1] "observation 3518 has 1 NA"
## [1] "observation 3544 has 1 NA"
## [1] "observation 3553 has 1 NA"
## [1] "observation 3572 has 1 NA"
## [1] "observation 3573 has 1 NA"
## [1] "observation 3582 has 1 NA"
## [1] "observation 3584 has 1 NA"
## [1] "observation 3585 has 4 NA"
## [1] "observation 3591 has 1 NA"
## [1] "observation 3607 has 1 NA"
## [1] "observation 3608 has 2 NA"
## [1] "observation 3609 has 1 NA"
## [1] "observation 3622 has 1 NA"
## [1] "observation 3625 has 1 NA"
## [1] "observation 3627 has 1 NA"
## [1] "observation 3631 has 2 NA"
## [1] "observation 3643 has 7 NA"
## [1] "observation 3644 has 2 NA"
## [1] "observation 3645 has 3 NA"
## [1] "observation 3648 has 1 NA"
## [1] "observation 3650 has 1 NA"
## [1] "observation 3681 has 1 NA"
## [1] "observation 3686 has 1 NA"
## [1] "observation 3694 has 1 NA"
## [1] "observation 3696 has 2 NA"
## [1] "observation 3714 has 1 NA"
## [1] "observation 3724 has 2 NA"
## [1] "observation 3726 has 1 NA"
## [1] "observation 3740 has 1 NA"
## [1] "observation 3741 has 1 NA"
## [1] "observation 3746 has 2 NA"
## [1] "observation 3749 has 8 NA"
## [1] "observation 3757 has 1 NA"
## [1] "observation 3761 has 1 NA"
## [1] "observation 3768 has 2 NA"
## [1] "observation 3770 has 1 NA"
## [1] "observation 3777 has 1 NA"
## [1] "observation 3780 has 2 NA"
## [1] "observation 3788 has 1 NA"
## [1] "observation 3794 has 1 NA"
## [1] "observation 3796 has 1 NA"
## [1] "observation 3799 has 1 NA"
## [1] "observation 3802 has 6 NA"
## [1] "observation 3812 has 7 NA"
## [1] "observation 3813 has 1 NA"
## [1] "observation 3814 has 1 NA"
## [1] "observation 3826 has 1 NA"
## [1] "observation 3830 has 1 NA"
## [1] "observation 3832 has 1 NA"
## [1] "observation 3838 has 1 NA"
## [1] "observation 3839 has 1 NA"
## [1] "observation 3840 has 1 NA"
## [1] "observation 3841 has 2 NA"
## [1] "observation 3842 has 1 NA"
## [1] "observation 3852 has 2 NA"
## [1] "observation 3855 has 3 NA"
## [1] "observation 3856 has 3 NA"
## [1] "observation 3858 has 1 NA"
## [1] "observation 3861 has 1 NA"
## [1] "observation 3875 has 1 NA"
## [1] "observation 3877 has 2 NA"
## [1] "observation 3881 has 1 NA"
## [1] "observation 3882 has 2 NA"
## [1] "observation 3886 has 1 NA"
## [1] "observation 3888 has 1 NA"
## [1] "observation 3890 has 1 NA"
print(paste(a,"out of 3891 obversations has NA", ". Which is only", round((a/3891)*100,digits = 2), "% of the data"))
## [1] "157 out of 3891 obversations has NA . Which is only 4.03 % of the data"

There are 26 columns/variables. The amount of NAs for observations that have NAs is very small (ranging from 1-8 NAs). Since there are 3891 observations in this dataset, we can safely remove all variables that has NAs which is only about 4% of the data set without introducing bias.

dat.cleaned <- na.omit(dat)
any(is.na(dat.cleaned))
## [1] FALSE
dim(dat.cleaned)
## [1] 3734   26

After removing 157 observations with NAs, there are now 3734 obs left which is still a large amount of dataset for unbias correlation exploration.

#Task 2 Exploration: #Based on the Dataset, calculate “Profit” and determine the relationship between “Profit” and other variables (e.g. IMDB score). Hint: Profit = Gross – Budget. Use line plot or scatter plot to find the relationship.

There are no NAs in variable gross nor budget so therefore we can create a column profit as usually.

dat2 <- dat.cleaned
dat2$Profit <- dat2$Gross - dat2$Budget
head(dat2)
##   Color          Director Reviews Duration Director_facebook_likes
## 1 Color     James Cameron     723      178                       0
## 3 Color        Sam Mendes     602      148                       0
## 4 Color Christopher Nolan     813      164                   22000
## 5 Color    Andrew Stanton     462      132                     475
## 6 Color         Sam Raimi     392      156                       0
## 7 Color      Nathan Greno     324      100                      15
##   Actor_3_facebook_likes     Actor_2_name Actor_1_facebook_likes     Gross
## 1                    855 Joel David Moore                   1000 760505847
## 3                    161     Rory Kinnear                  11000 200074175
## 4                  23000   Christian Bale                  27000 448130642
## 5                    530  Samantha Morton                    640  73058679
## 6                   4000     James Franco                  24000 336530303
## 7                    284     Donna Murphy                    799 200807262
##       Genre    Actor_1_name                  Title   Votes
## 1    Action     CCH Pounder                Avatara  886204
## 3    Action Christoph Waltz               Spectrea  275868
## 4    Action       Tom Hardy The Dark Knight Risesa 1144337
## 5    Action    Daryl Sabara           John Cartera  212204
## 6    Action    J.K. Simmons          Spider-Man 3a  383056
## 7 Adventure    Brad Garrett               Tangleda  294810
##   Cast_total_facebook_likes         Actor_3_name Facenumber_in_poster
## 1                      2791            Wes Studi                    0
## 3                     11554     Stephanie Sigman                    1
## 4                     95000 Joseph Gordon-Levitt                    0
## 5                      2277         Polly Walker                    1
## 6                     39000        Kirsten Dunst                    0
## 7                      1651          M.C. Gainey                    1
##                                                      Plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5               alien|american civil war|male nipple|mars|princess
## 6                        sandman|spider man|symbiote|venom|villain
## 7             17th century|based on fairy tale|disney|flower|tower
##                                        Movie_imdb_link Language Content_rating
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1  English          PG-13
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1  English          PG-13
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1  English          PG-13
## 5 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1  English          PG-13
## 6 http://www.imdb.com/title/tt0413300/?ref_=fn_tt_tt_1  English          PG-13
## 7 http://www.imdb.com/title/tt0398286/?ref_=fn_tt_tt_1  English             PG
##      Budget Year Actor_2_facebook_likes Imdb_score Aspect_ratio
## 1 237000000 2009                    936        7.9         1.78
## 3 245000000 2015                    393        6.8         2.35
## 4 250000000 2012                  23000        8.5         2.35
## 5 263700000 2012                    632        6.6         2.35
## 6 258000000 2007                  11000        6.2         2.35
## 7 260000000 2010                    553        7.8         1.85
##   Movie_facebook_likes     Profit
## 1                33000  523505847
## 3                85000  -44925825
## 4               164000  198130642
## 5                24000 -190641321
## 6                    0   78530303
## 7                29000  -59192738

We will explore the relationship between profit and other comparable variables such as: Movie_facebook_likes, Aspect_ratio, Imdb_score, Actor_2_facebook_likes, Year, Budget, Content_rating, Facenumber_in_poster, Cast_total_facebook_likes, Votes, Genre, Gross, Actor_1_facebook_likes, Actor_3_facebook_likes, Director_facebook_likes, Duration, Reviews, Color.

#Profit vs Imdb_score
ggplot(data = dat2, mapping = aes(y = Profit, x = Imdb_score)) + geom_point(color = 'darkblue', alpha=1/5)

Since there are a few outliners of movies that made losses, it has made the scatter plot scale spread out and hard to visualise the correlation. For example, according to the plot, there was a movie that lost 12 billions dollars. It is questionable if this movie really made a loss of 12 billions dollars or was it a data error. Therefore, a new dataframe will be created to only take entries for only movies that made more than -200 millions dollars.

dat_profit <- dat2[dat2$Profit >= -2e8,]
head(dat_profit)
##   Color          Director Reviews Duration Director_facebook_likes
## 1 Color     James Cameron     723      178                       0
## 3 Color        Sam Mendes     602      148                       0
## 4 Color Christopher Nolan     813      164                   22000
## 5 Color    Andrew Stanton     462      132                     475
## 6 Color         Sam Raimi     392      156                       0
## 7 Color      Nathan Greno     324      100                      15
##   Actor_3_facebook_likes     Actor_2_name Actor_1_facebook_likes     Gross
## 1                    855 Joel David Moore                   1000 760505847
## 3                    161     Rory Kinnear                  11000 200074175
## 4                  23000   Christian Bale                  27000 448130642
## 5                    530  Samantha Morton                    640  73058679
## 6                   4000     James Franco                  24000 336530303
## 7                    284     Donna Murphy                    799 200807262
##       Genre    Actor_1_name                  Title   Votes
## 1    Action     CCH Pounder                Avatara  886204
## 3    Action Christoph Waltz               Spectrea  275868
## 4    Action       Tom Hardy The Dark Knight Risesa 1144337
## 5    Action    Daryl Sabara           John Cartera  212204
## 6    Action    J.K. Simmons          Spider-Man 3a  383056
## 7 Adventure    Brad Garrett               Tangleda  294810
##   Cast_total_facebook_likes         Actor_3_name Facenumber_in_poster
## 1                      2791            Wes Studi                    0
## 3                     11554     Stephanie Sigman                    1
## 4                     95000 Joseph Gordon-Levitt                    0
## 5                      2277         Polly Walker                    1
## 6                     39000        Kirsten Dunst                    0
## 7                      1651          M.C. Gainey                    1
##                                                      Plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5               alien|american civil war|male nipple|mars|princess
## 6                        sandman|spider man|symbiote|venom|villain
## 7             17th century|based on fairy tale|disney|flower|tower
##                                        Movie_imdb_link Language Content_rating
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1  English          PG-13
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1  English          PG-13
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1  English          PG-13
## 5 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1  English          PG-13
## 6 http://www.imdb.com/title/tt0413300/?ref_=fn_tt_tt_1  English          PG-13
## 7 http://www.imdb.com/title/tt0398286/?ref_=fn_tt_tt_1  English             PG
##      Budget Year Actor_2_facebook_likes Imdb_score Aspect_ratio
## 1 237000000 2009                    936        7.9         1.78
## 3 245000000 2015                    393        6.8         2.35
## 4 250000000 2012                  23000        8.5         2.35
## 5 263700000 2012                    632        6.6         2.35
## 6 258000000 2007                  11000        6.2         2.35
## 7 260000000 2010                    553        7.8         1.85
##   Movie_facebook_likes     Profit
## 1                33000  523505847
## 3                85000  -44925825
## 4               164000  198130642
## 5                24000 -190641321
## 6                    0   78530303
## 7                29000  -59192738
ggplot(data = dat_profit, mapping = aes(y = Profit, x = Imdb_score)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(se=FALSE, 
              method = "lm", 
              formula = y~poly(x,1), 
              size = 1.5)

There is a positive correlation between Imdb_score and profit. As Imdb_score increases, profit is predicted to increase also.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Movie_facebook_likes)) + geom_point(color = 'darkblue', alpha =1/5)+geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between Movie_facebook_likes and profit. As Movie_facebook_likes increases, profit is predicted to increase also. However, there is 1 outliner where Movie_facebook_likes is really high but profit is low. If it wasn’t for this outliner, the regression line may even show a higher positive correlation.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Aspect_ratio)) + geom_point(color = 'darkblue', alpha=1/5)+geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

According to the regression line, there seems to be a slight negative correlation between profit and Aspect_ratio. However, there is an outliner of Aspect_ratio being 16 with low profit, this could have skewed the regression line. Perhaps if it wasn’t for this outliner, the regression line could have been horizontal, hence no relationship between Profit and Aspect_ratio.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Actor_2_facebook_likes)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between Actor_2_facebook_likes and Profit. As Actor_2_facebook_likes increases, profit seems to also increase. There is an outliner of really high Actor_2_facebook_likes but low profit. Perhaps if it wasn’t for this outliner, the regression line would have been steeper, hence, higher positive correlation.

ggplot(data = dat_profit, mapping = aes(x = Year, y = Profit)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Interestingly and unexpectedly, as Year increases, Profit seems to slightly decrease. There is a negative correlation between Year and Profit. There are many new movies that make really high profit but there are also many that made losses.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Content_rating)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

In terms of Content_rating, PG-13 and PG have movies from losses and to high profit and some outliners with really high profit. R and G are spread out from losses to medium and high profit. Not Rated seem to make more losses than profit. Almost all aprroved rating seem to make profit but they are as high PG, PG-13, G, and R. The rest of the categories seem to not have as many movies.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Budget)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There seems to be a very small positive relationship between Profit and Budget. However, since budget is used to calculate profit, there should be a negative relationship between budget and profit if we hadn’t removed the all movies with profit less than $200,000,000.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Facenumber_in_poster)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a slight negative correlation between Facenumber_in_poster and Profit. There are also 2 outliners where Facenumber_in_poster are high but Profit is low which could have skewed the regression line a little bit. If it wasn’t for these 2 outliners, may we find that the regression line could be a little bit flatter which could mean that there is no correlation between Facenumber_in_poster and Profit.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Cast_total_facebook_likes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between Cast_total_facebook_likes and Profit. It is predicted that as Cast_total_facebook_likes increases, profit increases. If it wasn’t for the outliner in the far right, the positive correlation may even be stronger.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Votes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a positive correlation between Votes and Profit. It is predicted that as Votes increases, profit increases.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Genre)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") + theme(axis.text.x = element_text(angle = 45, hjust=1))
## `geom_smooth()` using formula 'y ~ x'

It can be said that action movies have big budget and as predicted earlier, the bigger the budget, the larger the profit. This could explain why the profit can be amongst the highest categories. Many action movies are also big losers in terms of profit as well. Other popular categories where profit ranges from really low to really high are Adventure, Animation, Comedy, Drama. Middle range categories are Biography, Crime, Horror, Fantasy, Mystery. Animation is spread out between low and high profit. Suprising Romance, Thriller, Musical, Western and Family don’t do so well and fall in the range that barely make profit.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Gross)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

As expected, the more money you make, the higher the profit. This shows strong positive correlation correlation.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Actor_1_facebook_likes)) + geom_point(color = 'darkblue', alpha =1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a very slight positive correlation between Actor_1_facebook_likes and Profit. As Actor_1_facebook_likes increases, Profit is expected to slightly increase. If it wasn’t for the far right outliner, the regression line might even be steeper.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Actor_3_facebook_likes)) + geom_point(color = 'darkblue', alpha = 1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Actor_1_facebook_likes is sligher more important than Actor_3_facebook_likes and a lot of actors got no fb likes. There is a positive correlation between Actor_3_facebook_likes and Profit.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Director_facebook_likes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a very low positive correlation between Director_facebook_likes and Profit, many directors with no to little fb likes.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Duration)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a low positive correlation between Duration and Profit. The majority of movies have duration between 50 mins to 150 mins. The longer the duration of a movie doesn’t mean higher the profit.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Reviews)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a medium positive correlation between Reviews and Profit. As reviews increases, Profit is predicted to increase.

ggplot(data = dat_profit, mapping = aes(y = Profit, x = Color)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(se=FALSE, 
              method = "lm", 
              formula = y~poly(x,1), 
              size = 1.5)

ggplot(data = dat_profit, mapping = aes(x = Profit, y = ..density..)) + geom_freqpoly(mapping = aes(colour = Color), binwidth = 500000)

Generally, Colar movies can make a lot more money than Black and White but they can also lose a lot more profit as well. There are more Color movies as supposed to Black and White. In the density plot, it can be seen that a large amount of black and white movies make little to no profit.

#Calculate the correlation between the variable(s) used in the Dataset.

#Select only numeric variables for correlation matrix
dat_num <- dat2 %>% select_if(is.numeric)
dat_cor<- cor(dat_num)
round(dat_cor, 2)
##                           Reviews Duration Director_facebook_likes
## Reviews                      1.00     0.23                    0.18
## Duration                     0.23     1.00                    0.18
## Director_facebook_likes      0.18     0.18                    1.00
## Actor_3_facebook_likes       0.25     0.13                    0.12
## Actor_1_facebook_likes       0.16     0.08                    0.09
## Gross                        0.46     0.24                    0.14
## Votes                        0.59     0.34                    0.30
## Cast_total_facebook_likes    0.25     0.14                    0.27
## Facenumber_in_poster        -0.03     0.03                   -0.05
## Budget                       0.10     0.07                    0.02
## Year                         0.42    -0.13                   -0.04
## Actor_2_facebook_likes       0.25     0.13                    0.12
## Imdb_score                   0.35     0.37                    0.19
## Aspect_ratio                 0.18     0.15                    0.04
## Movie_facebook_likes         0.70     0.22                    0.16
## Profit                       0.04     0.01                    0.02
##                           Actor_3_facebook_likes Actor_1_facebook_likes Gross
## Reviews                                     0.25                   0.16  0.46
## Duration                                    0.13                   0.08  0.24
## Director_facebook_likes                     0.12                   0.09  0.14
## Actor_3_facebook_likes                      1.00                   0.25  0.30
## Actor_1_facebook_likes                      0.25                   1.00  0.14
## Gross                                       0.30                   0.14  1.00
## Votes                                       0.27                   0.18  0.62
## Cast_total_facebook_likes                   0.45                   0.94  0.23
## Facenumber_in_poster                        0.10                   0.06 -0.03
## Budget                                      0.04                   0.02  0.10
## Year                                        0.12                   0.09  0.05
## Actor_2_facebook_likes                      0.55                   0.39  0.25
## Imdb_score                                  0.06                   0.09  0.22
## Aspect_ratio                                0.05                   0.06  0.06
## Movie_facebook_likes                        0.27                   0.13  0.37
## Profit                                      0.05                   0.03  0.21
##                           Votes Cast_total_facebook_likes Facenumber_in_poster
## Reviews                    0.59                      0.25                -0.03
## Duration                   0.34                      0.14                 0.03
## Director_facebook_likes    0.30                      0.27                -0.05
## Actor_3_facebook_likes     0.27                      0.45                 0.10
## Actor_1_facebook_likes     0.18                      0.94                 0.06
## Gross                      0.62                      0.23                -0.03
## Votes                      1.00                      0.28                -0.03
## Cast_total_facebook_likes  0.28                      1.00                 0.06
## Facenumber_in_poster      -0.03                      0.06                 1.00
## Budget                     0.06                      0.03                -0.02
## Year                       0.02                      0.11                 0.07
## Actor_2_facebook_likes     0.24                      0.63                 0.07
## Imdb_score                 0.48                      0.14                -0.07
## Aspect_ratio               0.08                      0.07                 0.02
## Movie_facebook_likes       0.52                      0.21                 0.02
## Profit                     0.13                      0.04                 0.01
##                           Budget  Year Actor_2_facebook_likes Imdb_score
## Reviews                     0.10  0.42                   0.25       0.35
## Duration                    0.07 -0.13                   0.13       0.37
## Director_facebook_likes     0.02 -0.04                   0.12       0.19
## Actor_3_facebook_likes      0.04  0.12                   0.55       0.06
## Actor_1_facebook_likes      0.02  0.09                   0.39       0.09
## Gross                       0.10  0.05                   0.25       0.22
## Votes                       0.06  0.02                   0.24       0.48
## Cast_total_facebook_likes   0.03  0.11                   0.63       0.14
## Facenumber_in_poster       -0.02  0.07                   0.07      -0.07
## Budget                      1.00  0.05                   0.03       0.03
## Year                        0.05  1.00                   0.12      -0.14
## Actor_2_facebook_likes      0.03  0.12                   1.00       0.10
## Imdb_score                  0.03 -0.14                   0.10       1.00
## Aspect_ratio                0.02  0.22                   0.06       0.03
## Movie_facebook_likes        0.05  0.31                   0.23       0.28
## Profit                     -0.95 -0.03                   0.04       0.04
##                           Aspect_ratio Movie_facebook_likes Profit
## Reviews                           0.18                 0.70   0.04
## Duration                          0.15                 0.22   0.01
## Director_facebook_likes           0.04                 0.16   0.02
## Actor_3_facebook_likes            0.05                 0.27   0.05
## Actor_1_facebook_likes            0.06                 0.13   0.03
## Gross                             0.06                 0.37   0.21
## Votes                             0.08                 0.52   0.13
## Cast_total_facebook_likes         0.07                 0.21   0.04
## Facenumber_in_poster              0.02                 0.02   0.01
## Budget                            0.02                 0.05  -0.95
## Year                              0.22                 0.31  -0.03
## Actor_2_facebook_likes            0.06                 0.23   0.04
## Imdb_score                        0.03                 0.28   0.04
## Aspect_ratio                      1.00                 0.11  -0.01
## Movie_facebook_likes              0.11                 1.00   0.06
## Profit                           -0.01                 0.06   1.00
corrplot(dat_cor, method = "pie", type = "upper", tl.pos="l")

We will use dat2 to plot for correlations that don’t involve profit as one of the variables. This way we can incldue all observations.

dat_profit will be used for most correlations that have profit as one of the variables because due to outliners, it would easier to visualise when we exclude those that make less than $200 millions. However, dat2 will be used for the correlation of budget and profit and the reason will be explained below.

#Strong correlations

Actor_1_facebook_likes has a strong positive correlation with Cast_total_facebook_likes with coefficient of 0.94.

ggplot(data = dat2, mapping = aes(y = Cast_total_facebook_likes, x = Actor_1_facebook_likes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is a very strong positive correlation between Actor_1_facebook_likes and Cast_total_facebook_likes probably because Actor_1_facebook_likes is a large part of Cast_total_facebook_likes so as Actor_1_facebook_likes increases, Cast_total_facebook_likes is expected to also increase as well.

Reviews has a strong positive correlation with Movie_facebook_likes with coefficient of 0.7

ggplot(data = dat2, mapping = aes(x = Movie_facebook_likes, y = Reviews)) + geom_point(color = 'darkblue',alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Reviews has a strong positive correlation with Movie_facebook_likes probably because if a movie has more reviews on IMDB, people will also look at that movie on fb and chances of that movie getting likes will also increase.

Reviews and Votes have positive correlation with coefficient 0.59

ggplot(data = dat2, mapping = aes(x = Votes, y = Reviews)) + geom_point(color = 'darkblue', alpha = 1/5) +geom_smooth(se=FALSE, 
              method = "lm", 
              formula = y~poly(x,1), 
              size = 1.5)

It is plausible to argue that the higher the votes are, the more reviews a movie can get because audience can have the tendency to select a movie to watch based on votes. After watching, they will review the movie.

Gross and Votes have positive correlation with coefficient being 0.62

ggplot(data = dat2, mapping = aes(x = Votes, y = Gross)) + geom_point(color = 'darkblue', alpha = 1/10) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

It could be said that the more Votes a movies, the more people would want to watch the movie and hence the higher Gross income.

Actor_2_facebook_likes and Cast_total_facebook_likes have positive correlation with coefficient being 0.63

ggplot(data = dat2, mapping = aes(x = Actor_2_facebook_likes, y = Cast_total_facebook_likes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Actor_2_facebook_likes contributes to Cast_total_facebook_likes so the positive correlation here is self-explanatory. However, there are 2 outliners. The first one has high Actor_2_facebook_likes but low Cast_total_facebook_likes, this could be because the actor is popular but the rest of the cast is not. The second one has high Cast_total_facebook_likes but low Actor_2_facebook_likes, this could be because the 2nd actor is not popular but the rest of the cast are more fans’ favorites.

Budget has a very strong negative correlation with Profit with coefficient of -0.95

ggplot(data = dat2, mapping = aes(x = Budget, y = Profit)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

dat2 is used here instead of dat_profit because we want to include all movies so we can get the most accurate regression line for this correlation when coefficient is -0.95. If we use dat_profit, the regression line will be slightly upwards which doesn’t represent coefficient of -0.95.

Profit is Gross - Budget so mathematically, the higher the Budget, the lower the Profit and hence the negative correlation, hence the steep downwards slope.

#Weak correlations There are many weak correlations, some of the weakly correlated pairs are:

Duration and Profit only have coefficient of 0.01

ggplot(data = dat_profit, mapping = aes(x = Duration, y = Profit)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

This is probably because the duration of the movie does not really determine how successful the movie is going to be. A long movie can have bad quality that leads to low profit.

Aspect_ratio and Profit only have coefficient of -0.01

ggplot(data = dat_profit, mapping = aes(x = Aspect_ratio, y = Profit)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

In addition to the explanation about profit and aspect_ratio above, the aspect_ratio is probably isn’t one of the main factors that make the audience fall in love with a movie, it’s probably more about the content of the movie. Therefore, aspect_ratio doesn’t have a direct impact on a movie’s profit.

Facenumber_in_poster and Profit only have coefficient of 0.01

ggplot(data = dat_profit, mapping = aes(x = Facenumber_in_poster, y = Profit)) + geom_point(color = 'darkblue',alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is probably isn’t a correlation here because the amount of faces appear in a poster doesn’t directly impact how successful a movie is. It is arguable that many bombshell movies have only 1 or 2 faces on the poster. Having too many faces on the poster could make tha poster forgetable and hence reduces the popularity of the movie which could also negatively impact a movie’s income.

Year and Votes only have coefficient of 0.01

ggplot(data = dat2, mapping = aes(x = Year, y = Votes)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

Newer movies do not necessarily have higher Votes than their older counterparts. There are many newer movies with low Votes as well. In other words, how old a movie is does not have an impact on the amount of Votes.

Budget and Actor_1_facebook_likes only have coefficient of 0.02

ggplot(data = dat2, mapping = aes(y = Actor_1_facebook_likes, x = Budget)) + geom_point(color = 'darkblue', alpha=1/5)+geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

A high budget movie does not mean that the main actor will get a lot of facebook likes given the fact that the budget is mainly used for the production of the movie, not to promote and PR the main actor’s private facebook page.

Aspect_ratio and Imdb_score only have coefficient of 0.03

ggplot(data = dat2, mapping = aes(x = Aspect_ratio, y= Imdb_score)) + geom_point(color = 'darkblue', alpha=1/5) +geom_smooth(method = "lm") 
## `geom_smooth()` using formula 'y ~ x'

There is an outliner of of aspect_ratio 16 that happen to have a decent Imdb_score, if it wasn’t for that, the regression line would probably be flatter. It is plausible to say that high aspect_ratio does not impact Imdb_score because Imdb_score is probably more influenced by the content, story line, quality of the movie rather than the aspect_ratio alone.

#Task 2

dat <- read.csv("Titanic.csv", header=TRUE)
names(dat)
## [1] "X"        "Class"    "Sex"      "Age"      "Survived"
str(dat)
## 'data.frame':    2201 obs. of  5 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Class   : chr  "3rd" "3rd" "3rd" "3rd" ...
##  $ Sex     : chr  "Male" "Male" "Male" "Male" ...
##  $ Age     : chr  "Child" "Child" "Child" "Child" ...
##  $ Survived: chr  "No" "No" "No" "No" ...
sum(is.na(dat)) #Total amount of NAs before replacing blank cells with NAs
## [1] 0
#replace blank with NA
dat[dat == ""] <- NA
sum(is.na(dat)) #Total amount of NAs after replacing blank cells with NAs
## [1] 0

There are no NA nor blanks

summary(dat)
##        X           Class               Sex                Age           
##  Min.   :   1   Length:2201        Length:2201        Length:2201       
##  1st Qu.: 551   Class :character   Class :character   Class :character  
##  Median :1101   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1101                                                           
##  3rd Qu.:1651                                                           
##  Max.   :2201                                                           
##    Survived        
##  Length:2201       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Need to convert Class, Sex, Aged and Survived into factors for further analysis. Also this can pick up bad data entries if any.

dat$Class <- factor(dat$Class)
dat$Sex <- factor(dat$Sex)
dat$Survived <- factor(dat$Survived)
dat$Age <- factor(dat$Age)
summary(dat)
##        X         Class         Sex          Age       Survived  
##  Min.   :   1   1st :325   Female: 470   Adult:2092   No :1490  
##  1st Qu.: 551   2nd :285   Male  :1731   Child: 109   Yes: 711  
##  Median :1101   3rd :706                                        
##  Mean   :1101   Crew:885                                        
##  3rd Qu.:1651                                                   
##  Max.   :2201
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
class(dat)
## [1] "data.frame"
dat2 <- dat[,-1]
dat2 <- as(dat2, "transactions")
inspect(dat2[1:10])
##      items                                      transactionID
## [1]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 1            
## [2]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 2            
## [3]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 3            
## [4]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 4            
## [5]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 5            
## [6]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 6            
## [7]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 7            
## [8]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 8            
## [9]  {Class=3rd,Sex=Male,Age=Child,Survived=No} 9            
## [10] {Class=3rd,Sex=Male,Age=Child,Survived=No} 10
survied_rules = apriori(dat2, parameter = list(minlen=2,supp=0.005,conf=0.8), appearance = list(default = "lhs", rhs ="Survived=Yes"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 11 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[10 item(s), 2201 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(survied_rules)
##     lhs                                  rhs            support     confidence
## [1] {Class=2nd,Age=Child}             => {Survived=Yes} 0.010904134 1.0000000 
## [2] {Class=2nd,Sex=Female}            => {Survived=Yes} 0.042253521 0.8773585 
## [3] {Class=1st,Sex=Female}            => {Survived=Yes} 0.064061790 0.9724138 
## [4] {Class=Crew,Sex=Female}           => {Survived=Yes} 0.009086779 0.8695652 
## [5] {Class=2nd,Sex=Female,Age=Child}  => {Survived=Yes} 0.005906406 1.0000000 
## [6] {Class=2nd,Sex=Female,Age=Adult}  => {Survived=Yes} 0.036347115 0.8602151 
## [7] {Class=1st,Sex=Female,Age=Adult}  => {Survived=Yes} 0.063607451 0.9722222 
## [8] {Class=Crew,Sex=Female,Age=Adult} => {Survived=Yes} 0.009086779 0.8695652 
##     coverage    lift     count
## [1] 0.010904134 3.095640  24  
## [2] 0.048159927 2.715986  93  
## [3] 0.065879146 3.010243 141  
## [4] 0.010449796 2.691861  20  
## [5] 0.005906406 3.095640  13  
## [6] 0.042253521 2.662916  80  
## [7] 0.065424807 3.009650 140  
## [8] 0.010449796 2.691861  20

Sort rules based on lift: rules_lift <- sort (rules, by=“lift”, decreasing=TRUE) # ‘high-lift’ rules.

rules_lift <- sort (survied_rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(rules_lift)
##     lhs                                  rhs            support     confidence
## [1] {Class=2nd,Age=Child}             => {Survived=Yes} 0.010904134 1.0000000 
## [2] {Class=2nd,Sex=Female,Age=Child}  => {Survived=Yes} 0.005906406 1.0000000 
## [3] {Class=1st,Sex=Female}            => {Survived=Yes} 0.064061790 0.9724138 
## [4] {Class=1st,Sex=Female,Age=Adult}  => {Survived=Yes} 0.063607451 0.9722222 
## [5] {Class=2nd,Sex=Female}            => {Survived=Yes} 0.042253521 0.8773585 
## [6] {Class=Crew,Sex=Female}           => {Survived=Yes} 0.009086779 0.8695652 
## [7] {Class=Crew,Sex=Female,Age=Adult} => {Survived=Yes} 0.009086779 0.8695652 
## [8] {Class=2nd,Sex=Female,Age=Adult}  => {Survived=Yes} 0.036347115 0.8602151 
##     coverage    lift     count
## [1] 0.010904134 3.095640  24  
## [2] 0.005906406 3.095640  13  
## [3] 0.065879146 3.010243 141  
## [4] 0.065424807 3.009650 140  
## [5] 0.048159927 2.715986  93  
## [6] 0.010449796 2.691861  20  
## [7] 0.010449796 2.691861  20  
## [8] 0.042253521 2.662916  80

Above are the pairs of entries with highest lift values

# Top 3 are as follow:
inspect(rules_lift[1:3])
##     lhs                                 rhs            support     confidence
## [1] {Class=2nd,Age=Child}            => {Survived=Yes} 0.010904134 1.0000000 
## [2] {Class=2nd,Sex=Female,Age=Child} => {Survived=Yes} 0.005906406 1.0000000 
## [3] {Class=1st,Sex=Female}           => {Survived=Yes} 0.064061790 0.9724138 
##     coverage    lift     count
## [1] 0.010904134 3.095640  24  
## [2] 0.005906406 3.095640  13  
## [3] 0.065879146 3.010243 141
plot(rules_lift[1:3], method ="graph")

Question: Based on the generated rules, is there any difference between “Children” travelling first class, second class, and third class? Support your answer with an appropriate argument

Based on the rules that has highest lift values, there are not enough information about Children in first class and third class to compare.

survied_rules_class = apriori(dat2, parameter = list(minlen=2,supp=0.001,conf=0.3), appearance = list(lhs = c("Class=1st", "Class=2nd","Class=3rd" , "Age=Child"), rhs ="Survived=Yes"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2 
## 
## set item appearances ...[5 item(s)] done [0.00s].
## set transactions ...[5 item(s), 2201 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [6 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules_lift_class <- sort (survied_rules_class, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(rules_lift_class)
##     lhs                      rhs            support     confidence coverage   
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134 1.0000000  0.010904134
## [2] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034 1.0000000  0.002726034
## [3] {Class=1st}           => {Survived=Yes} 0.092230804 0.6246154  0.147660154
## [4] {Age=Child}           => {Survived=Yes} 0.025897319 0.5229358  0.049522944
## [5] {Class=2nd}           => {Survived=Yes} 0.053611995 0.4140351  0.129486597
## [6] {Class=3rd,Age=Child} => {Survived=Yes} 0.012267151 0.3417722  0.035892776
##     lift     count
## [1] 3.095640  24  
## [2] 3.095640   6  
## [3] 1.933584 203  
## [4] 1.618821  57  
## [5] 1.281704 118  
## [6] 1.058004  27

The reason why 1st class and 3rd class children didn’t show up in survied_rules is because the support and confident level was set too high then. Here we can see that 3rd class children have support of 0.01 and confidence of 0.34 so we needed to lower the support to get the rules for this demographics.

inspect(rules_lift_class[c(1,2,6)])
##     lhs                      rhs            support     confidence coverage   
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134 1.0000000  0.010904134
## [2] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034 1.0000000  0.002726034
## [3] {Class=3rd,Age=Child} => {Survived=Yes} 0.012267151 0.3417722  0.035892776
##     lift     count
## [1] 3.095640 24   
## [2] 3.095640  6   
## [3] 1.058004 27
plot(rules_lift_class[c(1,2,6)], method ="graph")

Based on the rules and the graph, 1st class and 2nd class children are 3 times more likely to survive than 3rd class children (lift of 3.09 vs lift of 1.05). The confidence of 1 for children in 1st class and 2nd class showed that 100% of these demographics survived whereas confidence of 0.34 for 3rd class children means that only 34% of these demographics survived. In other words, 3rd class children are 3 times less likely to survive than those on 1st and 2nd class.

The size of the circle indicates that there are a lot more children on 2nd class than first class who survived. Based on the support, 1% of the population are 2nd class children (all survived) whereas 0.2% of the population are 1st class children (all survived). 1.2% of the population is 3rd class children who survived but only a bit more than 30% of all 3rd class children survived. In other words, the majority of 3rd class children did not survived.

Explore more rules

#Set the support and confidence low to get more rules
survied_rules_explore = apriori(dat2, parameter = list(minlen=2,supp=0.06,conf=0.2), appearance = list(default="lhs", rhs ="Survived=Yes"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.06      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 132 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[10 item(s), 2201 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(survied_rules_explore)
##      lhs                                 rhs            support    confidence
## [1]  {Class=1st}                      => {Survived=Yes} 0.09223080 0.6246154 
## [2]  {Sex=Female}                     => {Survived=Yes} 0.15629259 0.7319149 
## [3]  {Class=3rd}                      => {Survived=Yes} 0.08087233 0.2521246 
## [4]  {Class=Crew}                     => {Survived=Yes} 0.09631985 0.2395480 
## [5]  {Sex=Male}                       => {Survived=Yes} 0.16674239 0.2120162 
## [6]  {Age=Adult}                      => {Survived=Yes} 0.29713766 0.3126195 
## [7]  {Class=1st,Sex=Female}           => {Survived=Yes} 0.06406179 0.9724138 
## [8]  {Class=1st,Age=Adult}            => {Survived=Yes} 0.08950477 0.6175549 
## [9]  {Sex=Female,Age=Adult}           => {Survived=Yes} 0.14357110 0.7435294 
## [10] {Class=3rd,Age=Adult}            => {Survived=Yes} 0.06860518 0.2408293 
## [11] {Class=Crew,Sex=Male}            => {Survived=Yes} 0.08723308 0.2227378 
## [12] {Class=Crew,Age=Adult}           => {Survived=Yes} 0.09631985 0.2395480 
## [13] {Sex=Male,Age=Adult}             => {Survived=Yes} 0.15356656 0.2027594 
## [14] {Class=1st,Sex=Female,Age=Adult} => {Survived=Yes} 0.06360745 0.9722222 
## [15] {Class=Crew,Sex=Male,Age=Adult}  => {Survived=Yes} 0.08723308 0.2227378 
##      coverage   lift      count
## [1]  0.14766015 1.9335843 203  
## [2]  0.21353930 2.2657450 344  
## [3]  0.32076329 0.7804871 178  
## [4]  0.40208996 0.7415544 212  
## [5]  0.78646070 0.6563257 367  
## [6]  0.95047706 0.9677574 654  
## [7]  0.06587915 3.0102430 141  
## [8]  0.14493412 1.9117275 197  
## [9]  0.19309405 2.3016993 316  
## [10] 0.28487051 0.7455209 151  
## [11] 0.39164016 0.6895161 192  
## [12] 0.40208996 0.7415544 212  
## [13] 0.75738301 0.6276702 338  
## [14] 0.06542481 3.0096499 140  
## [15] 0.39164016 0.6895161 192

711 out of 1490 who survived so about 1/3 of the population survived. So in general, an average persion has about 33% chance of surviving the titanic and lift for the general population to survive is 1.

Female passengers are 2 times more likely to survive compared to the general population (lift =2.2). They also have a much higher chance of survival compared to 3rd class and male passengers ( lift 2.2 > 0.78 > 0.65).

Third class passengers are less likely to survive compared to the general population(lift =0.78 vs lift = 1). However 3rd class passengers have little more chance to survive compared to male passenagers (0.78 vs 0.65 lift).

Only 23% of crew survived and they are less likely to survive compared to the general population. They are about 25% less likely to survive compared to the general population (lift 0.74 vs lift 1) and they are only a little bit more likely to survive than male passengers (lift 0.74 vs lift 0.65).

Male passengers’ survival chances are low (lift only 0.65) so lower than the general population (0.65 vs 1), a little bit lower than 3rd class passenger (0.65 vs 0.78) and a lot lower than their female counterpart (0.65 vs 2.2).

First class passengers are almost twice more likely to survive compared to the general population (lift 1.9 vs lift 1). They are a little bit less likely to survive compared to women (lift 1.9 vs lift 2.2), however, almost 3 times more likely to survive compared to male passengers (lift 1.9 vs lift 0.65).

If you’re a first class female adult, your chances of surviving is nearly 100% (confidence =0.97). There are only about 6.3% (support = 0.063) of 1st class female adults on the ship. Your chances of surviving would be 3 times higher than the general population (lift 3 vs 1) and also higher than the general female demographics (lift 3 vs 2.2).

>>>>>>> ca3c49adc5f41b36a4263b89311b94664c9fc8e9